Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

Crash this afternoon, want to verify I'm understanding this configuration correctly

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


m+pacemaker at nerdish

Feb 23, 2012, 3:37 PM

Post #1 of 2 (315 views)
Permalink
Crash this afternoon, want to verify I'm understanding this configuration correctly

Hello,

I have a pretty simple cluster running with three nodes, xen1, xen2, and
qnode (which runs in standby at all times and only exists for quorum).
This afternoon xen1 reset out of the blue. There is nothing in its logs,
in fact there's a gap from 15:37 to 15:47:

Feb 23 15:36:18 xen1 lrmd: [5328]: WARN: G_SIG_dispatch: Dispatch function
for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource:
0x1d96080)
Feb 23 15:36:18 xen1 lrmd: [5328]: info: G_SIG_dispatch: started at
2571780644 should have started at 2571780544
Feb 23 15:37:39 xen1 lrmd: [5328]: WARN: G_SIG_dispatch: Dispatch function
for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource:
0x1d96080)
Feb 23 15:37:39 xen1 lrmd: [5328]: info: G_SIG_dispatch: started at
2571788778 should have started at 2571788678
Feb 23 15:47:53 xen1 xenstored: Checking store ...
Feb 23 15:47:54 xen1 acpid: starting up with netlink and the input layer
Feb 23 15:47:54 xen1 xenstored: Checking store complete.
....


I'm kicking myself for lack of a serial console, so I really don't know
what happened prior to the reboot. The two surviving nodes logs these very
similar events, so it appears they were both seeing each other fine and
neither could see anything from xen1:

Feb 23 15:32:26 qnode pengine: [2096]: info: process_pe_message: Transition
10271: PEngine Input stored in: /var/lib/pengine/pe-input-4479.bz2
Feb 23 15:36:51 qnode cib: [2093]: info: cib_stats: Processed 1 operations
(10000.00us average, 0% utilization) in the last 10min
Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] notice: pcmk_peer_update:
Transitional membership event on ring 1772: memb=2, new=0, lost=1
Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: pcmk_peer_update:
memb: xen2 1241622720
Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: pcmk_peer_update:
memb: qnode 1308731584
Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: pcmk_peer_update:
lost: xen1 1224845504
Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] notice: pcmk_peer_update:
Stable membership event on ring 1772: memb=2, new=0, lost=0
Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: pcmk_peer_update:
MEMB: xen2 1241622720
Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: pcmk_peer_update:
MEMB: qnode 1308731584
Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info:
ais_mark_unseen_peer_dead: Node xen1 was not seen in the previous transition
Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: update_member: Node
1224845504/xen1 is now: lost
Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info:
send_member_notification: Sending membership update 1772 to 2 children
Feb 23 15:42:02 qnode corosync[2081]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Feb 23 15:42:02 qnode crmd: [2097]: info: ais_dispatch: Membership 1772:
quorum retained
Feb 23 15:42:02 qnode cib: [2093]: info: ais_dispatch: Membership 1772:
quorum retained
Feb 23 15:42:02 qnode crmd: [2097]: info: ais_status_callback: status: xen1
is now lost (was member)


Those are followed by very verbose logs covering fencing attempts (via SBD)
by both qnode and xen2 which never succeeded, leading me to belive that at
this point xen1 had fenced itself either due to SBD or softdog and was in
the process of POST/bootup. When xen1 finished booting it rejoined the
cluster just fine. There are no failure- or alert-type messages in the
logs on any of the three hosts prior to the 15:42:02 corosync entries
above, where xen1 has disappeared.

Based on these entries from corosync.conf:

# How long before declaring a token lost (ms)
token: 3000
# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 10

... am I correct in thinking that the two healthy nodes went at least 30
seconds without seeing anything from xen1, then started their fencing
attempts? In the complete absence of any logs with clues as to what
happend, I'm just trying to build a likely scenario for what happened. My
guess is that xen1 had some kind of a panic or loss of network on all four
interfaces (two iSCSI interfaces to the SBD and storage device, two bonded
interfaces facing the service-side of things, where corosync talks). The
xen servers are PowerEdge R815s and OpenManage is running, but there are no
events/errors logged with it either, no hardware issues logged at all.

In any case, this isn't a complaint or issue report, simply an "am I
thinking clearly about what may have happened?" message. If this ever
happens in the future, hopefully a serial console will provide answers.
(Side note: For that, I'm thinking minicom connected to both and writing
logs, is there something nicer that I just don't know about?).

Regards,
Mark


andrew at beekhof

Feb 23, 2012, 3:49 PM

Post #2 of 2 (299 views)
Permalink
Re: Crash this afternoon, want to verify I'm understanding this configuration correctly [In reply to]

On Fri, Feb 24, 2012 at 10:37 AM, mark - pacemaker list
<m+pacemaker [at] nerdish> wrote:
> Hello,
>
> I have a pretty simple cluster running with three nodes, xen1, xen2, and
> qnode (which runs in standby at all times and only exists for quorum).  This
> afternoon xen1 reset out of the blue.  There is nothing in its logs, in fact
> there's a gap from 15:37 to 15:47:
>
> Feb 23 15:36:18 xen1 lrmd: [5328]: WARN: G_SIG_dispatch: Dispatch function
> for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource:
> 0x1d96080)
> Feb 23 15:36:18 xen1 lrmd: [5328]: info: G_SIG_dispatch: started at
> 2571780644 should have started at 2571780544
> Feb 23 15:37:39 xen1 lrmd: [5328]: WARN: G_SIG_dispatch: Dispatch function
> for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource:
> 0x1d96080)
> Feb 23 15:37:39 xen1 lrmd: [5328]: info: G_SIG_dispatch: started at
> 2571788778 should have started at 2571788678
> Feb 23 15:47:53 xen1 xenstored: Checking store ...
> Feb 23 15:47:54 xen1 acpid: starting up with netlink and the input layer
> Feb 23 15:47:54 xen1 xenstored: Checking store complete.
> ....
>
>
> I'm kicking myself for lack of a serial console, so I really don't know what
> happened prior to the reboot.  The two surviving nodes logs these very
> similar events, so it appears they were both seeing each other fine and
> neither could see anything from xen1:
>
> Feb 23 15:32:26 qnode pengine: [2096]: info: process_pe_message: Transition
> 10271: PEngine Input stored in: /var/lib/pengine/pe-input-4479.bz2
> Feb 23 15:36:51 qnode cib: [2093]: info: cib_stats: Processed 1 operations
> (10000.00us average, 0% utilization) in the last 10min
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] notice: pcmk_peer_update:
> Transitional membership event on ring 1772: memb=2, new=0, lost=1
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
> memb: xen2 1241622720
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
> memb: qnode 1308731584
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
> lost: xen1 1224845504
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] notice: pcmk_peer_update:
> Stable membership event on ring 1772: memb=2, new=0, lost=0
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
> MEMB: xen2 1241622720
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
> MEMB: qnode 1308731584
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info:
> ais_mark_unseen_peer_dead: Node xen1 was not seen in the previous transition
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: update_member: Node
> 1224845504/xen1 is now: lost
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info:
> send_member_notification: Sending membership update 1772 to 2 children
> Feb 23 15:42:02 qnode corosync[2081]:   [TOTEM ] A processor joined or left
> the membership and a new membership was formed.
> Feb 23 15:42:02 qnode crmd: [2097]: info: ais_dispatch: Membership 1772:
> quorum retained
> Feb 23 15:42:02 qnode cib: [2093]: info: ais_dispatch: Membership 1772:
> quorum retained
> Feb 23 15:42:02 qnode crmd: [2097]: info: ais_status_callback: status: xen1
> is now lost (was member)
>
>
> Those are followed by very verbose logs covering fencing attempts (via SBD)
> by both qnode and xen2 which never succeeded, leading me to belive that at
> this point xen1 had fenced itself either due to SBD or softdog and was in
> the process of POST/bootup.  When xen1 finished booting it rejoined the
> cluster just fine.  There are no failure- or alert-type messages in the logs
> on any of the three hosts prior to the 15:42:02 corosync entries above,
> where xen1 has disappeared.
>
> Based on these entries from corosync.conf:
>
>         # How long before declaring a token lost (ms)
>         token: 3000
>         # How many token retransmits before forming a new configuration
>         token_retransmits_before_loss_const: 10
>
> ... am I correct in thinking that the two healthy nodes went at least 30
> seconds without seeing anything from xen1, then started their fencing
> attempts?

Certainly sounds reasonable. Not 100% sure if the timeout is a simple
"token * retransmits" calculation but it wouldn't be far off.

>  In the complete absence of any logs with clues as to what
> happend, I'm just trying to build a likely scenario for what happened.  My
> guess is that xen1 had some kind of a panic or loss of network on all four
> interfaces (two iSCSI interfaces to the SBD and storage device, two bonded
> interfaces facing the service-side of things, where corosync talks).  The
> xen servers are PowerEdge R815s and OpenManage is running, but there are no
> events/errors logged with it either, no hardware issues logged at all.
>
> In any case, this isn't a complaint or issue report, simply an "am I
> thinking clearly about what may have happened?" message.  If this ever
> happens in the future, hopefully a serial console will provide answers.
>  (Side note:  For that, I'm thinking minicom connected to both and writing
> logs, is there something nicer that I just don't know about?).
>
> Regards,
> Mark
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.