Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

pacemaker strange behaviour

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


sivann at inaccess

May 11, 2012, 8:05 AM

Post #1 of 3 (337 views)
Permalink
pacemaker strange behaviour

Hello all,
Corosync/pacemaker with 3 nodes(vserver1, vserver2, vserver3), 28 resources
defined, with quorum, stonith (via ipmi/ilo).
Most of the time all these work correctly but when we shutdown servers and
try to restart them, then strange things may happen:

1) all servers could be stuck in an unclean state (unclean online itself,
unclean offline the others, OR ok itself, unclean offline the others). The
only way to resolve this is to shutdown the severs and start them with a
big time interval between, several times.

2) in the last startup of the servers, two servers were up, quorum was
true, and then when the 3rd started to boot one of the two issued stonith
to the 3rd, and then to the other one, and then to itself (the latter
failed). This happened several times. resulting in downtime.
In the logs below vserver1 should stonith vserver3 (since it's down) but it
should'nt stonith vserver2 since it has quorum with it.

We are in the dark, please help if you can.

Best Regards,
Spiros Ioannou

(from hb_report's events.txt)

May 10 20:08:58 vserver1 corosync[3965]: [MAIN ] Corosync Cluster Engine
('1.2.1'): started and ready to provide service.
May 10 20:42:25 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update:
memb: vserver1 218169610
May 10 20:42:25 vserver1 crmd: [3999]: info: crm_update_quorum: Updating
quorum status to false (call=36)
May 10 20:48:44 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update:
memb: vserver1 218169610
May 10 20:48:44 vserver1 crmd: [3999]: notice: ais_dispatch: Membership
4804: quorum acquired
May 10 20:48:44 *vserver1* crmd: [3999]: info: crm_update_quorum:* Updating
quorum status to true* (call=87)
May 10 20:48:44 vserver2 corosync[4202]: [MAIN ] Corosync Cluster Engine
('1.2.1'): started and ready to provide service.
May 10 20:48:45 vserver2 crmd: [4233]: notice: ais_dispatch: Membership
4804: quorum acquired
May 10 20:48:46 *vserver1* crmd: [3999]: info: crm_update_quorum:
Updating *quorum
status to true* (call=102)
May 10 20:48:53 vserver1 crmd: [3999]: info: te_fence_node: Executing
poweroff fencing operation (50) on vserver3 (timeout=180000)
May 10 20:48:53 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
requests a STONITH operation POWEROFF on node vserver3
May 10 20:49:00 vserver1 stonithd: [3994]: info: Succeeded to STONITH the
node vserver3: optype=POWEROFF. whodoit: vserver1
May 10 20:50:59 vserver1 crmd: [3999]: info: te_fence_node: Executing
poweroff fencing operation (36) on vserver2 (timeout=180000)
May 10 20:50:59 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
requests a STONITH operation POWEROFF on node vserver2
May 10 20:51:06 vserver1 stonithd: [3994]: info: Succeeded to STONITH the
node vserver2: optype=POWEROFF. whodoit: vserver1
May 10 20:51:06 vserver1 crmd: [3999]: info: te_fence_node: Executing
poweroff fencing operation (19) on vserver1 (timeout=180000)
May 10 20:51:06 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
requests a STONITH operation POWEROFF on node vserver1
May 10 20:51:10 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update:
memb: vserver1 218169610
May 10 20:51:10 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update:
lost: vserver2 184615178
May 10 20:51:10 vserver1 crmd: [3999]: notice: ais_dispatch: Membership
4808: quorum lost
May 10 20:51:10 vserver1 crmd: [3999]: info: crm_update_quorum: Updating
quorum status to false (call=149)
May 10 20:54:06 vserver1 stonithd: [3994]: ERROR: Failed to STONITH the
node vserver1: optype=POWEROFF, op_result=TIMEOUT

AND (analysis.txt)

May 10 20:42:00 vserver1 crmd: [3999]: ERROR: crm_timer_popped: Integration
Timer (I_INTEGRATED) just popped!
May 10 20:42:56 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
vm-websrv:monitor process (PID 4191) will not die!
May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
vm-alm:monitor process (PID 4267) will not die!
May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
vm-be:monitor process (PID 4268) will not die!
May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
vm-cam:monitor process (PID 4269) will not die!
May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation vm-websrv_monitor_0 (20) Timed Out (timeout=20000ms)
May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation vm-alm_monitor_0 (10) Timed Out (timeout=20000ms)
May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation vm-be_monitor_0 (11) Timed Out (timeout=20000ms)
May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation vm-cam_monitor_0 (12) Timed Out (timeout=20000ms)
May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-cam_start_0 (44) Timed Out (timeout=20000ms)
May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-devops_start_0 (46) Timed Out (timeout=20000ms)
May 10 20:49:26 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-router_start_0 (48) Timed Out (timeout=20000ms)
May 10 20:49:27 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-zabbix_start_0 (49) Timed Out (timeout=20000ms)
May 10 20:49:40 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-ipsecgw_start_0 (51) Timed Out (timeout=20000ms)
May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-cam_stop_0 (56) Timed Out (timeout=20000ms)
May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-devops_stop_0 (58) Timed Out (timeout=20000ms)
May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-router_stop_0 (59) Timed Out (timeout=20000ms)
May 10 20:50:22 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-zabbix_stop_0 (61) Timed Out (timeout=20000ms)
May 10 20:50:23 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-ipsecgw_stop_0 (63) Timed Out (timeout=20000ms)
May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
Invalid inputs (nil).(nil) 0x127c090.(nil)
May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
Invalid inputs (nil).(nil) 0x127c880.(nil)
May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
Invalid inputs (nil).(nil) 0x127cd30.(nil)
May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
Invalid inputs (nil).(nil) 0x127d1e0.(nil)


dejanmm at fastmail

May 14, 2012, 8:36 AM

Post #2 of 3 (310 views)
Permalink
Re: pacemaker strange behaviour [In reply to]

Hi,

On Fri, May 11, 2012 at 06:05:27PM +0300, Spiros Ioannou wrote:
> Hello all,
> Corosync/pacemaker with 3 nodes(vserver1, vserver2, vserver3), 28 resources
> defined, with quorum, stonith (via ipmi/ilo).
> Most of the time all these work correctly but when we shutdown servers and
> try to restart them, then strange things may happen:
>
> 1) all servers could be stuck in an unclean state (unclean online itself,
> unclean offline the others, OR ok itself, unclean offline the others). The
> only way to resolve this is to shutdown the severs and start them with a
> big time interval between, several times.
>
> 2) in the last startup of the servers, two servers were up, quorum was
> true, and then when the 3rd started to boot one of the two issued stonith
> to the 3rd, and then to the other one, and then to itself (the latter
> failed).

A node cannot fence itself. That's prohibited.

> This happened several times. resulting in downtime.
> In the logs below vserver1 should stonith vserver3 (since it's down) but it
> should'nt stonith vserver2 since it has quorum with it.
>
> We are in the dark, please help if you can.

Is this new, i.e. was this cluster working fine for a while
earlier? I suppose that you did some testing before going live.

Below some notes, but in general cannot offer much help.

> Best Regards,
> Spiros Ioannou
>
> (from hb_report's events.txt)
>
> May 10 20:08:58 vserver1 corosync[3965]: [MAIN ] Corosync Cluster Engine
> ('1.2.1'): started and ready to provide service.
> May 10 20:42:25 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update:
> memb: vserver1 218169610
> May 10 20:42:25 vserver1 crmd: [3999]: info: crm_update_quorum: Updating
> quorum status to false (call=36)
> May 10 20:48:44 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update:
> memb: vserver1 218169610
> May 10 20:48:44 vserver1 crmd: [3999]: notice: ais_dispatch: Membership
> 4804: quorum acquired
> May 10 20:48:44 *vserver1* crmd: [3999]: info: crm_update_quorum:* Updating
> quorum status to true* (call=87)
> May 10 20:48:44 vserver2 corosync[4202]: [MAIN ] Corosync Cluster Engine
> ('1.2.1'): started and ready to provide service.
> May 10 20:48:45 vserver2 crmd: [4233]: notice: ais_dispatch: Membership
> 4804: quorum acquired
> May 10 20:48:46 *vserver1* crmd: [3999]: info: crm_update_quorum:
> Updating *quorum
> status to true* (call=102)
> May 10 20:48:53 vserver1 crmd: [3999]: info: te_fence_node: Executing
> poweroff fencing operation (50) on vserver3 (timeout=180000)
> May 10 20:48:53 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
> requests a STONITH operation POWEROFF on node vserver3
> May 10 20:49:00 vserver1 stonithd: [3994]: info: Succeeded to STONITH the
> node vserver3: optype=POWEROFF. whodoit: vserver1
> May 10 20:50:59 vserver1 crmd: [3999]: info: te_fence_node: Executing
> poweroff fencing operation (36) on vserver2 (timeout=180000)
> May 10 20:50:59 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
> requests a STONITH operation POWEROFF on node vserver2
> May 10 20:51:06 vserver1 stonithd: [3994]: info: Succeeded to STONITH the
> node vserver2: optype=POWEROFF. whodoit: vserver1
> May 10 20:51:06 vserver1 crmd: [3999]: info: te_fence_node: Executing
> poweroff fencing operation (19) on vserver1 (timeout=180000)
> May 10 20:51:06 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
> requests a STONITH operation POWEROFF on node vserver1
> May 10 20:51:10 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update:
> memb: vserver1 218169610
> May 10 20:51:10 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update:
> lost: vserver2 184615178
> May 10 20:51:10 vserver1 crmd: [3999]: notice: ais_dispatch: Membership
> 4808: quorum lost
> May 10 20:51:10 vserver1 crmd: [3999]: info: crm_update_quorum: Updating
> quorum status to false (call=149)
> May 10 20:54:06 vserver1 stonithd: [3994]: ERROR: Failed to STONITH the
> node vserver1: optype=POWEROFF, op_result=TIMEOUT
>
> AND (analysis.txt)
>
> May 10 20:42:00 vserver1 crmd: [3999]: ERROR: crm_timer_popped: Integration
> Timer (I_INTEGRATED) just popped!
> May 10 20:42:56 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
> vm-websrv:monitor process (PID 4191) will not die!
> May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
> vm-alm:monitor process (PID 4267) will not die!
> May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
> vm-be:monitor process (PID 4268) will not die!
> May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
> vm-cam:monitor process (PID 4269) will not die!

This seems to point to processes in the D state (waiting for
device). Your storage hanging/temporarily gone?

> May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation vm-websrv_monitor_0 (20) Timed Out (timeout=20000ms)
> May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation vm-alm_monitor_0 (10) Timed Out (timeout=20000ms)
> May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation vm-be_monitor_0 (11) Timed Out (timeout=20000ms)
> May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation vm-cam_monitor_0 (12) Timed Out (timeout=20000ms)
> May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-cam_start_0 (44) Timed Out (timeout=20000ms)
> May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-devops_start_0 (46) Timed Out (timeout=20000ms)
> May 10 20:49:26 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-router_start_0 (48) Timed Out (timeout=20000ms)
> May 10 20:49:27 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-zabbix_start_0 (49) Timed Out (timeout=20000ms)
> May 10 20:49:40 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-ipsecgw_start_0 (51) Timed Out (timeout=20000ms)
> May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-cam_stop_0 (56) Timed Out (timeout=20000ms)
> May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-devops_stop_0 (58) Timed Out (timeout=20000ms)
> May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-router_stop_0 (59) Timed Out (timeout=20000ms)
> May 10 20:50:22 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-zabbix_stop_0 (61) Timed Out (timeout=20000ms)
> May 10 20:50:23 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-ipsecgw_stop_0 (63) Timed Out (timeout=20000ms)
> May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
> Invalid inputs (nil).(nil) 0x127c090.(nil)
> May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
> Invalid inputs (nil).(nil) 0x127c880.(nil)
> May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
> Invalid inputs (nil).(nil) 0x127cd30.(nil)
> May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
> Invalid inputs (nil).(nil) 0x127d1e0.(nil)

Hmm, that's rather cryptic. Don't know what does it mean.
Perhaps search the ML archives?

Thanks,

Dejan

> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


sivann at inaccess

May 15, 2012, 1:45 AM

Post #3 of 3 (312 views)
Permalink
Re: pacemaker strange behaviour [In reply to]

On Mon, May 14, 2012 at 6:36 PM, Dejan Muhamedagic <dejanmm [at] fastmail> wrote:
> Hi,
>
Hi, thanks for your time to review this.

>
> A node cannot fence itself. That's prohibited.
yes, that is normal The question is why it tried to fence itself (and
the other).

> Is this new, i.e. was this cluster working fine for a while
> earlier? I suppose that you did some testing before going live.

It's not new, but it seems to happen in this setup. A similar setup in
another datacenter doesn't seem to have this behaviour.

> Below some notes, but in general cannot offer much help.
...
> This seems to point to processes in the D state (waiting for
> device). Your storage hanging/temporarily gone?
yes, they are VMs running on iscsi targets

>
> Hmm, that's rather cryptic. Don't know what does it mean.
> Perhaps search the ML archives?
no much luck there..

>
> Thanks,
>
> Dejan

Thanks,
-Spiros

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.