Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

lost node and split brain cluster

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


frank at si

Dec 22, 2008, 12:00 AM

Post #1 of 4 (1485 views)
Permalink
lost node and split brain cluster

Hi,
we have a two nodes cluster with heartbeat 2.1.4 on a RHEL 5 Linux
We have OpenVZ on both nodes, a shared GFS disk with VE data, and
ManageVE resources, some on node1 on some on node2
We are experimenting a strange behavior. Sometimes VEs of one node
appear started on both nodes. If we monitors with hb_gui, one of the
nodes is showed stopped although heartbeat is started there. Logs in the
node with all resources started looks like this:

........................
heartbeat[24950]: 2008/12/19_18:07:54 info: time_longclock: clock_t wrapped around (uptime).
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 42949663
06 ms (> 510 ms) before being called (GSource: 0x9744758)
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
crmd[25148]: 2008/12/19_18:07:54 notice: crmd_ha_status_callback: Status update: Node parmenides.upc.es now has status [dead]
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64112728
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
heartbeat[24950]: 2008/12/19_18:07:54 WARN: node parmenides.upc.es: is dead
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info
heartbeat[24950]: 2008/12/19_18:07:54 info: Link parmenides.upc.es:eth1 dead.
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 42949665
46 ms (> 510 ms) before being called (GSource: 0x9744988)
cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: LOST: parmenides.upc.es
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64112704
cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: PEER: heraclito.upc.es
crmd[25148]: 2008/12/19_18:07:54 info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=41)
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for update msgfree count was delayed 42949
67216 ms (> 30000 ms) before being called (GSource: 0x9744a08)
ccm[25142]: 2008/12/19_18:07:54 info: Break tie for 2 nodes cluster
crmd[25148]: 2008/12/19_18:07:54 info: erase_node_from_join: Removed dead node parmenides.upc.es from join calculations: welcome
d=0 itegrated=0 finalized=0 confirmed=1
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64112637
crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: NEW MEMBERSHIP: trans=41, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for client audit was delayed 4294962206 ms
(> 5000 ms) before being called (GSource: 0x9744908)
crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: CURRENT: heraclito.upc.es [nodeid=0, born=41]
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64113138
crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: LOST: parmenides.upc.es [nodeid=1, born=40]
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for memory stats was delayed 4244815296 ms
(> 5000 ms) before being called (GSource: 0x9744888)
..................
....................
heartbeat[24950]: 2008/12/19_18:07:55 CRIT: Cluster node parmenides.upc.es returning after partition.
heartbeat[24950]: 2008/12/19_18:07:55 info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[24950]: 2008/12/19_18:07:55 WARN: Deadtime value may be too small.
heartbeat[24950]: 2008/12/19_18:07:55 info: See FAQ for information on tuning deadtime.
heartbeat[24950]: 2008/12/19_18:07:55 info: URL: http://linux-ha.org/FAQ#heavy_load
heartbeat[24950]: 2008/12/19_18:07:55 info: Link parmenides.upc.es:eth1 up.
heartbeat[24950]: 2008/12/19_18:07:55 info: Status update for node parmenides.upc.es: status active
crmd[25148]: 2008/12/19_18:07:55 notice: crmd_ha_status_callback: Status update: Node parmenides.upc.es now has status [active]
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: heraclito.upc.es
crmd[25148]: 2008/12/19_18:07:56 info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=42)
cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: parmenides.upc.es
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW MEMBERSHIP: trans=42, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, ol
d_idx=4
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: CURRENT: heraclito.upc.es [nodeid=0, born=38]
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: CURRENT: parmenides.upc.es [nodeid=1, born=42]
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW: parmenides.upc.es [nodeid=1, born=42]
.................


It seems like one of the nodes is lost and latter recovered, but the
cluster do not recover normality. Resources on recovered node keeps on
started on the
other. The only way I have found to solve the situation is to do manual
action on the cluster, for instance to stop a resource. If I do that,
then cluster seems
to realize the situation, detects that some resources are multi-homed,
and corrects the situation.

What is happening? Does heavy load affects as the log messages says? But
why cluster doesn't recover normal status?
Thanks for your help.

Frank




--
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
For all your IT requirements visit: http://www.transtec.co.uk

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Dec 22, 2008, 8:27 AM

Post #2 of 4 (1418 views)
Permalink
Re: lost node and split brain cluster [In reply to]

Hi,

On Mon, Dec 22, 2008 at 09:00:51AM +0100, Frank wrote:
> Hi,
> we have a two nodes cluster with heartbeat 2.1.4 on a RHEL 5 Linux
> We have OpenVZ on both nodes, a shared GFS disk with VE data, and ManageVE
> resources, some on node1 on some on node2
> We are experimenting a strange behavior. Sometimes VEs of one node appear
> started on both nodes. If we monitors with hb_gui, one of the
> nodes is showed stopped although heartbeat is started there. Logs in the
> node with all resources started looks like this:
>
> ........................
> heartbeat[24950]: 2008/12/19_18:07:54 info: time_longclock: clock_t wrapped around (uptime).

Strange, this should've been fixed a long time ago, definitely
before 2.1.4.

> heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 42949663
> 06 ms (> 510 ms) before being called (GSource: 0x9744758)
> cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
> crmd[25148]: 2008/12/19_18:07:54 notice: crmd_ha_status_callback: Status update: Node parmenides.upc.es now has status [dead]
> heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64112728
> cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info
> crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
> heartbeat[24950]: 2008/12/19_18:07:54 WARN: node parmenides.upc.es: is dead
> cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info
> heartbeat[24950]: 2008/12/19_18:07:54 info: Link parmenides.upc.es:eth1 dead.
> cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
> crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 42949665
> 46 ms (> 510 ms) before being called (GSource: 0x9744988)
> cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: LOST: parmenides.upc.es
> crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
> heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64112704
> cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: PEER: heraclito.upc.es
> crmd[25148]: 2008/12/19_18:07:54 info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=41)
> heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for update msgfree count was delayed 42949
> 67216 ms (> 30000 ms) before being called (GSource: 0x9744a08)
> ccm[25142]: 2008/12/19_18:07:54 info: Break tie for 2 nodes cluster
> crmd[25148]: 2008/12/19_18:07:54 info: erase_node_from_join: Removed dead node parmenides.upc.es from join calculations: welcome
> d=0 itegrated=0 finalized=0 confirmed=1
> heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64112637
> crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: NEW MEMBERSHIP: trans=41, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3
> heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for client audit was delayed 4294962206 ms
> (> 5000 ms) before being called (GSource: 0x9744908)
> crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: CURRENT: heraclito.upc.es [nodeid=0, born=41]
> heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64113138
> crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: LOST: parmenides.upc.es [nodeid=1, born=40]
> heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for memory stats was delayed 4244815296 ms
> (> 5000 ms) before being called (GSource: 0x9744888)
> ..................
> ....................
> heartbeat[24950]: 2008/12/19_18:07:55 CRIT: Cluster node parmenides.upc.es returning after partition.
> heartbeat[24950]: 2008/12/19_18:07:55 info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain
> heartbeat[24950]: 2008/12/19_18:07:55 WARN: Deadtime value may be too small.
> heartbeat[24950]: 2008/12/19_18:07:55 info: See FAQ for information on tuning deadtime.
> heartbeat[24950]: 2008/12/19_18:07:55 info: URL: http://linux-ha.org/FAQ#heavy_load
> heartbeat[24950]: 2008/12/19_18:07:55 info: Link parmenides.upc.es:eth1 up.
> heartbeat[24950]: 2008/12/19_18:07:55 info: Status update for node parmenides.upc.es: status active
> crmd[25148]: 2008/12/19_18:07:55 notice: crmd_ha_status_callback: Status update: Node parmenides.upc.es now has status [active]
> cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
> crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
> cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info
> crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info
> cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
> crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
> cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: heraclito.upc.es
> crmd[25148]: 2008/12/19_18:07:56 info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=42)
> cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: parmenides.upc.es
> crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW MEMBERSHIP: trans=42, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, ol
> d_idx=4
> crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: CURRENT: heraclito.upc.es [nodeid=0, born=38]
> crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: CURRENT: parmenides.upc.es [nodeid=1, born=42]
> crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW: parmenides.upc.es [nodeid=1, born=42]
> .................
>
>
> It seems like one of the nodes is lost and latter recovered, but the
> cluster do not recover normality. Resources on recovered node keeps on
> started on the
> other. The only way I have found to solve the situation is to do manual
> action on the cluster, for instance to stop a resource. If I do that, then
> cluster seems
> to realize the situation, detects that some resources are multi-homed, and
> corrects the situation.
>
> What is happening? Does heavy load affects as the log messages says? But
> why cluster doesn't recover normal status?

Can you please produce a report with hb_report for the period
around the accident and file a bugzilla.

Thanks,

Dejan

> Thanks for your help.
>
> Frank
>
>
>
>
> --
> Aquest missatge ha estat analitzat per MailScanner
> a la cerca de virus i d'altres continguts perillosos,
> i es considera que est? net.
> For all your IT requirements visit: http://www.transtec.co.uk
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Simon.Tavanyar at stratus

Jun 30, 2009, 7:03 AM

Post #3 of 4 (1004 views)
Permalink
Re: clock_t wrapped around causing false resourcestart failure [In reply to]

Hi Dejan,

The bug looks like a one-off occurrence. We run hundreds of hours of
system stress tests in a week, moving resources between main and standby
systems, and we haven't seen this error in a couple years. (There was a
longclock error back in 2007 found by my colleague Simon Graham).

The longclock wrap occurred within 2:45 of a reboot.
The apparent coincidence seems to be that we were starting resources on
a back-up node around 165 seconds after the node had been rebooted and
hearbeat restarted. As I expect you know, somewhere between 160 and 175
seconds after a heartbeat start, the longclock is configured to wrap.

The rareness of this makes me think we hit a really obscure window...

- Simon.


-----Original Message-----
From: linux-ha-bounces [at] lists
[mailto:linux-ha-bounces [at] lists] On Behalf Of Dejan
Muhamedagic
Sent: Tuesday, June 30, 2009 5:26 AM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] clock_t wrapped around causing false
resourcestart failure

Hi,

On Mon, Jun 29, 2009 at 02:53:33PM -0400, Tavanyar, Simon wrote:
> I'm running heartbeat 2.1.4
>
> I'm getting a false failure on a start of my ClusterAddr resource
> because in the same second that the resource starts, the clock_t wraps
> around.
> Has anyone else seen this behavior?

Can't recall. And that shouldn't have happened. The time wrap is
recognized (as the log message shows) and a wrap counter is added
to the high bits so that the time is still greater than the
previous timestamp.

Do you have any more information about this: Was it a one-off
occurrence? Did your system really had a long uptime? How long?

Thanks,

Dejan

> Jun 22 10:04:49 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing
> op=ClusterAddr_start_0
key=8:14:0:59f9d23b-effd-4ec4-a766-17ed34a92b34)
> Jun 22 10:04:49 node0 lrmd: [14910]: info: rsc:ClusterAddr: start
> Jun 22 10:04:49 node0 SpineFilesystem: running
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> Jun 22 10:04:49 node0 lrmd: [14910]: info: time_longclock: clock_t
> wrapped around (uptime).
> Jun 22 10:04:49 node0 lrmd: [14910]: WARN: ClusterAddr:start process
> (PID 17282) timed out (try 1). Killing with signal SIGTERM (15).
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> Jun 22 10:04:49 node0 crmd: [14913]: info: process_lrm_event: LRM
> operation SharedFs_monitor_30000 (call=17, rc=0) complete
> Jun 22 10:04:49 node0 lrmd: [14910]: WARN: operation start[18] on
> ocf::IPaddr2::ClusterAddr for client 14913, its parameters:
> ip=[134.111.29.140] cidr_netmask=[21] broadcast=[134.111.31.255]
> CRM_meta_timeout=[20000] crm_feature_set=[2.0] nic=[biz0] : pid
[17282]
> timed out
> Jun 22 10:04:49 node0 crmd: [14913]: ERROR: process_lrm_event: LRM
> operation ClusterAddr_start_0 (18) Timed Out (timeout=20000ms)
> Jun 22 10:04:50 node0 lrmd: [14910]: info: rsc:ClusterAddr: stop
> Jun 22 10:04:50 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing
> op=ClusterAddr_stop_0 key=2:15:0:59f9d23b-effd-4ec4-a766-17ed34a92b34)
> Jun 22 10:04:50 node0 crmd: [14913]: info: process_lrm_event: LRM
> operation ClusterAddr_stop_0 (call=19, rc=0) complete
>
>
> Thanks
> Simon
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Jun 30, 2009, 10:15 AM

Post #4 of 4 (992 views)
Permalink
Re: clock_t wrapped around causing false resourcestart failure [In reply to]

Hi Simon,

On Tue, Jun 30, 2009 at 10:03:29AM -0400, Tavanyar, Simon wrote:
> Hi Dejan,
>
> The bug looks like a one-off occurrence. We run hundreds of hours of
> system stress tests in a week, moving resources between main and standby
> systems, and we haven't seen this error in a couple years. (There was a
> longclock error back in 2007 found by my colleague Simon Graham).

OK, so you are well acquainted with the business. Probably better
than I am.

> The longclock wrap occurred within 2:45 of a reboot.
> The apparent coincidence seems to be that we were starting resources on
> a back-up node around 165 seconds after the node had been rebooted and
> hearbeat restarted. As I expect you know, somewhere between 160 and 175
> seconds after a heartbeat start, the longclock is configured to wrap.

No, I don't know and I couldn't find it.

> The rareness of this makes me think we hit a really obscure window...

Looks like it. But it should be thoroughly investigated. Though I
don't understand how it can happen if the timer is monotonously
increasing.

Thanks,

Dejan

> - Simon.
>
>
> -----Original Message-----
> From: linux-ha-bounces [at] lists
> [mailto:linux-ha-bounces [at] lists] On Behalf Of Dejan
> Muhamedagic
> Sent: Tuesday, June 30, 2009 5:26 AM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] clock_t wrapped around causing false
> resourcestart failure
>
> Hi,
>
> On Mon, Jun 29, 2009 at 02:53:33PM -0400, Tavanyar, Simon wrote:
> > I'm running heartbeat 2.1.4
> >
> > I'm getting a false failure on a start of my ClusterAddr resource
> > because in the same second that the resource starts, the clock_t wraps
> > around.
> > Has anyone else seen this behavior?
>
> Can't recall. And that shouldn't have happened. The time wrap is
> recognized (as the log message shows) and a wrap counter is added
> to the high bits so that the time is still greater than the
> previous timestamp.
>
> Do you have any more information about this: Was it a one-off
> occurrence? Did your system really had a long uptime? How long?
>
> Thanks,
>
> Dejan
>
> > Jun 22 10:04:49 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing
> > op=ClusterAddr_start_0
> key=8:14:0:59f9d23b-effd-4ec4-a766-17ed34a92b34)
> > Jun 22 10:04:49 node0 lrmd: [14910]: info: rsc:ClusterAddr: start
> > Jun 22 10:04:49 node0 SpineFilesystem: running
> > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > Jun 22 10:04:49 node0 lrmd: [14910]: info: time_longclock: clock_t
> > wrapped around (uptime).
> > Jun 22 10:04:49 node0 lrmd: [14910]: WARN: ClusterAddr:start process
> > (PID 17282) timed out (try 1). Killing with signal SIGTERM (15).
> > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > Jun 22 10:04:49 node0 crmd: [14913]: info: process_lrm_event: LRM
> > operation SharedFs_monitor_30000 (call=17, rc=0) complete
> > Jun 22 10:04:49 node0 lrmd: [14910]: WARN: operation start[18] on
> > ocf::IPaddr2::ClusterAddr for client 14913, its parameters:
> > ip=[134.111.29.140] cidr_netmask=[21] broadcast=[134.111.31.255]
> > CRM_meta_timeout=[20000] crm_feature_set=[2.0] nic=[biz0] : pid
> [17282]
> > timed out
> > Jun 22 10:04:49 node0 crmd: [14913]: ERROR: process_lrm_event: LRM
> > operation ClusterAddr_start_0 (18) Timed Out (timeout=20000ms)
> > Jun 22 10:04:50 node0 lrmd: [14910]: info: rsc:ClusterAddr: stop
> > Jun 22 10:04:50 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing
> > op=ClusterAddr_stop_0 key=2:15:0:59f9d23b-effd-4ec4-a766-17ed34a92b34)
> > Jun 22 10:04:50 node0 crmd: [14913]: info: process_lrm_event: LRM
> > operation ClusterAddr_stop_0 (call=19, rc=0) complete
> >
> >
> > Thanks
> > Simon
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA [at] lists
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.