Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

Sporadic problems of rejoin after split brain situation

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


roman.sidler at sidler-se

Mar 18, 2012, 4:53 PM

Post #1 of 5 (639 views)
Permalink
Sporadic problems of rejoin after split brain situation

Hi all
No, we have no crossover cable between the nodes.
The 2 nodes are linked by a switched network, and this works really fine except
the mentioned case.

It's rather easy to reproduce.
1. Activate a 2-node cluster
2. disconnect network connection (e.g by disconnect network adapter on VM)
3. wait until both nodes are active and act as DC
4. reconnect nodes
5. the new DC is elected

When step 4 encounters an unsteady network, sometimes step 5 will never be
reached and both nodes stays DC. They're are sending and receiving heartbeat
status messages.

The situation may be simulated by some repeatedly disconnects/connects

Versions:
pacemaker 1.1.6 (and 1.0.5)
heartbeat 3.0.7 (and 3.0.0)

Are there any ideas?
Or have somebody any hints how I could localize or debug this problem.

Thanks!
Kind regards
Roman



The logouts are everytime a little different however 2 examples here

*** pacemaker 1.0.5 ***
Mar 8 07:05:50 LAB19 heartbeat: [2979]: CRIT: Cluster node lab13 returning
after partition.
Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: For information on cluster
partitions, See URL: http://linux-ha.org/SplitBrain
Mar 8 07:05:50 LAB19 heartbeat: [2979]: WARN: Deadtime value may be too small.
Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: See FAQ for information on tuning
deadtime.
Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: URL: http://linux-
ha.org/FAQ#heavy_load
Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: Link lab13:eth0 up.
Mar 8 07:05:50 LAB19 heartbeat: [2979]: WARN: Late heartbeat: Node lab13:
interval 244130 ms
Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: Status update for node lab13:
status active
Mar 8 07:05:50 LAB19 crmd: [3083]: notice: crmd_ha_status_callback: Status
update: Node lab13 now has status [active] (DC=true)
Mar 8 07:05:50 LAB19 crmd: [3083]: info: crm_update_peer_proc: lab13.ais is now
online
Mar 8 07:05:50 LAB19 cib: [3079]: WARN: cib_peer_callback: Discarding
cib_apply_diff message (14a33) from lab13: not in our membership
Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: all clients are now paused
Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: instance=15,
nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 8 07:05:51 LAB19 crmd: [3083]: info: crmd_ccm_msg_callback: Quorum
(re)attained after event=NEW MEMBERSHIP (id=15)
Mar 8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: NEW MEMBERSHIP:
trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
Mar 8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: CURRENT: lab19
[nodeid=1, born=15]
Mar 8 07:05:51 LAB19 crmd: [3083]: info: populate_cib_nodes_ha: Requesting the
list of configured nodes
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=15, nodes=1,
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
event=NEW MEMBERSHIP (id=15)
Mar 8 07:05:51 LAB19 heartbeat: [2979]: info: all clients are now resumed
Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=16, nodes=1,
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
event=NEW MEMBERSHIP (id=16)
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=17, nodes=1,
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
event=NEW MEMBERSHIP (id=17)
Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: instance=16,
nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
:

*** pacemaker 1.1.6 ***
Mar 16 01:35:28 oan1 heartbeat: [7556]: CRIT: Cluster node oan3 returning after
partition.
Mar 16 01:35:28 oan1 heartbeat: [7556]: info: For information on cluster
partitions, See URL: http://linux-ha.org/wiki/Split_Brain
Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Deadtime value may be too small.
Mar 16 01:35:28 oan1 heartbeat: [7556]: info: See FAQ for information on tuning
deadtime.
Mar 16 01:35:28 oan1 heartbeat: [7556]: info: URL: http://linux-
ha.org/wiki/FAQ#Heavy_Load
Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Link oan3:eth0 up.
Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Late heartbeat: Node oan3:
interval 73300 ms
Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Status update for node oan3:
status active
Mar 16 01:35:28 oan1 crmd: [7601]: notice: crmd_ha_status_callback: Status
update: Node oan3 now has status [active]
Mar 16 01:35:28 oan1 crmd: [7601]: info: crm_update_peer_proc: oan3.ais is now
online
Mar 16 01:35:28 oan1 crmd: [7601]: WARN: match_down_event: No match for shutdown
action on a9dae67b-b98f-498b-ad1b-950fc73c2ab2
Mar 16 01:35:28 oan1 crmd: [7601]: info: te_update_diff: Stonith/shutdown of
a9dae67b-b98f-498b-ad1b-950fc73c2ab2 not matched
Mar 16 01:35:28 oan1 crmd: [7601]: info: abort_transition_graph:
te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state,
id=a9dae67b-b98f-498b-ad1b-950fc73c2ab2, magic=NA, cib=0.210.50) : Node failure
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: All 1 cluster
nodes are eligible to run resources.
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke: Query 204: Requesting the
current CIB: S_POLICY_ENGINE
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke_callback: Invoking the PE:
query=204, ref=pe_calc-dc-1331858128-123, seq=13, quorate=1
Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
rsc_clustercontrol (Started oan1)
Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rscIpaddr2
(Started oan1)
Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rsc_asterisk
(Started oan1)
Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rsc_tomcat5
(Started oan1)
Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
rsc_ntp_refresh (Started oan1)
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Mar 16 01:35:28 oan1 crmd: [7601]: info: unpack_graph: Unpacked transition 16: 0
actions in 0 synapses
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_te_invoke: Processing graph 16
(ref=pe_calc-dc-1331858128-123) derived from /var/lib/pengine/pe-input-201.bz2
Mar 16 01:35:28 oan1 pengine: [17673]: notice: process_pe_message: Transition
16: PEngine Input stored in: /var/lib/pengine/pe-input-201.bz2
Mar 16 01:35:28 oan1 crmd: [7601]: info: run_graph:
====================================================
Mar 16 01:35:28 oan1 crmd: [7601]: notice: run_graph: Transition 16 (Complete=0,
Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-
201.bz2): Complete
Mar 16 01:35:28 oan1 crmd: [7601]: info: te_graph_trigger: Transition 16 is now
complete
Mar 16 01:35:28 oan1 crmd: [7601]: info: notify_crmd: Transition 16 status: done
- <null>
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
origin=notify_crmd ]
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: Starting PEngine
Recheck Timer
Mar 16 01:35:30 oan1 cib: [7597]: info: cib_process_diff: Diff 0.210.56 ->
0.210.57 not applied to 0.210.50: current "num_updates" is less than required
Mar 16 01:35:30 oan1 cib: [7597]: WARN: cib_server_process_diff: Not requesting
full refresh in R/W mode
Mar 16 01:35:30 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: instance=14, nodes=1,
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:30 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
event=NEW MEMBERSHIP (id=14)
Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: instance=14, nodes=1,
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:30 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
(re)attained after event=NEW MEMBERSHIP (id=14)
Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
trans=14, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: CURRENT: oan1
[nodeid=0, born=14]
Mar 16 01:35:30 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
list of configured nodes
Mar 16 01:35:31 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: instance=15, nodes=1,
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: instance=15, nodes=1,
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:31 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
event=NEW MEMBERSHIP (id=15)
Mar 16 01:35:31 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
(re)attained after event=NEW MEMBERSHIP (id=15)
Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: CURRENT: oan1
[nodeid=0, born=15]
Mar 16 01:35:31 oan1 cib: [7597]: info: cib_process_request: Operation complete:
op cib_modify for section nodes (origin=local/crmd/205, version=0.210.51): ok
(rc=0)
Mar 16 01:35:31 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
list of configured nodes



_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andrew at beekhof

Mar 18, 2012, 9:01 PM

Post #2 of 5 (606 views)
Permalink
Re: Sporadic problems of rejoin after split brain situation [In reply to]

On Mon, Mar 19, 2012 at 10:53 AM, Roman Sidler
<roman.sidler [at] sidler-se> wrote:
> Hi all
> No, we have no crossover cable between the nodes.
> The 2 nodes are linked by a switched network, and this works really fine except
> the mentioned case.
>
> It's rather easy to reproduce.
> 1. Activate a 2-node cluster
> 2. disconnect network connection (e.g by disconnect network adapter on VM)
> 3. wait until both nodes are active and act as DC
> 4. reconnect nodes
> 5. the new DC is elected
>
> When step 4 encounters an unsteady network, sometimes step 5 will never be
> reached and both nodes stays DC. They're are sending and receiving heartbeat
> status messages.

The CCM (part of heartbeat) is known not to be particularly awesome in
such scenarios and it appears to be the case again here:

Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3

As far as pacemaker knows, its still all alone.

Perhaps give corosync 1.4.x a go instead of heartbeat?

>
> The situation may be simulated by some repeatedly disconnects/connects
>
> Versions:
> pacemaker 1.1.6 (and 1.0.5)
> heartbeat 3.0.7 (and 3.0.0)
>
> Are there any ideas?
> Or have somebody any hints how I could localize or debug this problem.
>
> Thanks!
> Kind regards
> Roman
>
>
>
> The logouts are everytime a little different however 2 examples here
>
> *** pacemaker 1.0.5 ***
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: CRIT: Cluster node lab13 returning
> after partition.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: For information on cluster
> partitions, See URL: http://linux-ha.org/SplitBrain
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: WARN: Deadtime value may be too small.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: See FAQ for information on tuning
> deadtime.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: URL: http://linux-
> ha.org/FAQ#heavy_load
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: Link lab13:eth0 up.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: WARN: Late heartbeat: Node lab13:
> interval 244130 ms
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: Status update for node lab13:
> status active
> Mar  8 07:05:50 LAB19 crmd: [3083]: notice: crmd_ha_status_callback: Status
> update: Node lab13 now has status [active] (DC=true)
> Mar  8 07:05:50 LAB19 crmd: [3083]: info: crm_update_peer_proc: lab13.ais is now
> online
> Mar  8 07:05:50 LAB19 cib: [3079]: WARN: cib_peer_callback: Discarding
> cib_apply_diff message (14a33) from lab13: not in our membership
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: all clients are now paused
> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: instance=15,
> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=15)
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: CURRENT: lab19
> [nodeid=1, born=15]
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=15)
> Mar  8 07:05:51 LAB19 heartbeat: [2979]: info: all clients are now resumed
> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=16, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=16)
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=17, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=17)
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: instance=16,
> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> :
>
> *** pacemaker 1.1.6 ***
> Mar 16 01:35:28 oan1 heartbeat: [7556]: CRIT: Cluster node oan3 returning after
> partition.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: For information on cluster
> partitions, See URL: http://linux-ha.org/wiki/Split_Brain
> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Deadtime value may be too small.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: See FAQ for information on tuning
> deadtime.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: URL: http://linux-
> ha.org/wiki/FAQ#Heavy_Load
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Link oan3:eth0 up.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Late heartbeat: Node oan3:
> interval 73300 ms
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Status update for node oan3:
> status active
> Mar 16 01:35:28 oan1 crmd: [7601]: notice: crmd_ha_status_callback: Status
> update: Node oan3 now has status [active]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: crm_update_peer_proc: oan3.ais is now
> online
> Mar 16 01:35:28 oan1 crmd: [7601]: WARN: match_down_event: No match for shutdown
> action on a9dae67b-b98f-498b-ad1b-950fc73c2ab2
> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_update_diff: Stonith/shutdown of
> a9dae67b-b98f-498b-ad1b-950fc73c2ab2 not matched
> Mar 16 01:35:28 oan1 crmd: [7601]: info: abort_transition_graph:
> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state,
> id=a9dae67b-b98f-498b-ad1b-950fc73c2ab2, magic=NA, cib=0.210.50) : Node failure
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: All 1 cluster
> nodes are eligible to run resources.
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke: Query 204: Requesting the
> current CIB: S_POLICY_ENGINE
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke_callback: Invoking the PE:
> query=204, ref=pe_calc-dc-1331858128-123, seq=13, quorate=1
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
> rsc_clustercontrol      (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rscIpaddr2
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rsc_asterisk
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rsc_tomcat5
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
> rsc_ntp_refresh (Started oan1)
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: unpack_graph: Unpacked transition 16: 0
> actions in 0 synapses
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_te_invoke: Processing graph 16
> (ref=pe_calc-dc-1331858128-123) derived from /var/lib/pengine/pe-input-201.bz2
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: process_pe_message: Transition
> 16: PEngine Input stored in: /var/lib/pengine/pe-input-201.bz2
> Mar 16 01:35:28 oan1 crmd: [7601]: info: run_graph:
> ====================================================
> Mar 16 01:35:28 oan1 crmd: [7601]: notice: run_graph: Transition 16 (Complete=0,
> Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-
> 201.bz2): Complete
> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_graph_trigger: Transition 16 is now
> complete
> Mar 16 01:35:28 oan1 crmd: [7601]: info: notify_crmd: Transition 16 status: done
> - <null>
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: Starting PEngine
> Recheck Timer
> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_process_diff: Diff 0.210.56 ->
> 0.210.57 not applied to 0.210.50: current "num_updates" is less than required
> Mar 16 01:35:30 oan1 cib: [7597]: WARN: cib_server_process_diff: Not requesting
> full refresh in R/W mode
> Mar 16 01:35:30 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: instance=14, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=14)
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: instance=14, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=14)
> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=14, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail:      CURRENT: oan1
> [nodeid=0, born=14]
> Mar 16 01:35:30 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
> Mar 16 01:35:31 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=15)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=15)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail:      CURRENT: oan1
> [nodeid=0, born=15]
> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_process_request: Operation complete:
> op cib_modify for section nodes (origin=local/crmd/205, version=0.210.51): ok
> (rc=0)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andrew at beekhof

Mar 18, 2012, 9:01 PM

Post #3 of 5 (603 views)
Permalink
Re: Sporadic problems of rejoin after split brain situation [In reply to]

On Mon, Mar 19, 2012 at 10:53 AM, Roman Sidler
<roman.sidler [at] sidler-se> wrote:
> Hi all
> No, we have no crossover cable between the nodes.
> The 2 nodes are linked by a switched network, and this works really fine except
> the mentioned case.
>
> It's rather easy to reproduce.
> 1. Activate a 2-node cluster
> 2. disconnect network connection (e.g by disconnect network adapter on VM)
> 3. wait until both nodes are active and act as DC
> 4. reconnect nodes
> 5. the new DC is elected
>
> When step 4 encounters an unsteady network, sometimes step 5 will never be
> reached and both nodes stays DC. They're are sending and receiving heartbeat
> status messages.

The CCM (part of heartbeat) is known not to be particularly awesome in
such scenarios and it appears to be the case again here:

Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3

As far as pacemaker knows, its still all alone.

Perhaps give corosync 1.4.x a go instead of heartbeat?

>
> The situation may be simulated by some repeatedly disconnects/connects
>
> Versions:
> pacemaker 1.1.6 (and 1.0.5)
> heartbeat 3.0.7 (and 3.0.0)
>
> Are there any ideas?
> Or have somebody any hints how I could localize or debug this problem.
>
> Thanks!
> Kind regards
> Roman
>
>
>
> The logouts are everytime a little different however 2 examples here
>
> *** pacemaker 1.0.5 ***
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: CRIT: Cluster node lab13 returning
> after partition.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: For information on cluster
> partitions, See URL: http://linux-ha.org/SplitBrain
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: WARN: Deadtime value may be too small.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: See FAQ for information on tuning
> deadtime.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: URL: http://linux-
> ha.org/FAQ#heavy_load
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: Link lab13:eth0 up.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: WARN: Late heartbeat: Node lab13:
> interval 244130 ms
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: Status update for node lab13:
> status active
> Mar  8 07:05:50 LAB19 crmd: [3083]: notice: crmd_ha_status_callback: Status
> update: Node lab13 now has status [active] (DC=true)
> Mar  8 07:05:50 LAB19 crmd: [3083]: info: crm_update_peer_proc: lab13.ais is now
> online
> Mar  8 07:05:50 LAB19 cib: [3079]: WARN: cib_peer_callback: Discarding
> cib_apply_diff message (14a33) from lab13: not in our membership
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: all clients are now paused
> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: instance=15,
> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=15)
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: CURRENT: lab19
> [nodeid=1, born=15]
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=15)
> Mar  8 07:05:51 LAB19 heartbeat: [2979]: info: all clients are now resumed
> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=16, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=16)
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=17, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=17)
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: instance=16,
> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> :
>
> *** pacemaker 1.1.6 ***
> Mar 16 01:35:28 oan1 heartbeat: [7556]: CRIT: Cluster node oan3 returning after
> partition.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: For information on cluster
> partitions, See URL: http://linux-ha.org/wiki/Split_Brain
> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Deadtime value may be too small.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: See FAQ for information on tuning
> deadtime.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: URL: http://linux-
> ha.org/wiki/FAQ#Heavy_Load
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Link oan3:eth0 up.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Late heartbeat: Node oan3:
> interval 73300 ms
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Status update for node oan3:
> status active
> Mar 16 01:35:28 oan1 crmd: [7601]: notice: crmd_ha_status_callback: Status
> update: Node oan3 now has status [active]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: crm_update_peer_proc: oan3.ais is now
> online
> Mar 16 01:35:28 oan1 crmd: [7601]: WARN: match_down_event: No match for shutdown
> action on a9dae67b-b98f-498b-ad1b-950fc73c2ab2
> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_update_diff: Stonith/shutdown of
> a9dae67b-b98f-498b-ad1b-950fc73c2ab2 not matched
> Mar 16 01:35:28 oan1 crmd: [7601]: info: abort_transition_graph:
> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state,
> id=a9dae67b-b98f-498b-ad1b-950fc73c2ab2, magic=NA, cib=0.210.50) : Node failure
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: All 1 cluster
> nodes are eligible to run resources.
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke: Query 204: Requesting the
> current CIB: S_POLICY_ENGINE
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke_callback: Invoking the PE:
> query=204, ref=pe_calc-dc-1331858128-123, seq=13, quorate=1
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
> rsc_clustercontrol      (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rscIpaddr2
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rsc_asterisk
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rsc_tomcat5
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
> rsc_ntp_refresh (Started oan1)
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: unpack_graph: Unpacked transition 16: 0
> actions in 0 synapses
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_te_invoke: Processing graph 16
> (ref=pe_calc-dc-1331858128-123) derived from /var/lib/pengine/pe-input-201.bz2
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: process_pe_message: Transition
> 16: PEngine Input stored in: /var/lib/pengine/pe-input-201.bz2
> Mar 16 01:35:28 oan1 crmd: [7601]: info: run_graph:
> ====================================================
> Mar 16 01:35:28 oan1 crmd: [7601]: notice: run_graph: Transition 16 (Complete=0,
> Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-
> 201.bz2): Complete
> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_graph_trigger: Transition 16 is now
> complete
> Mar 16 01:35:28 oan1 crmd: [7601]: info: notify_crmd: Transition 16 status: done
> - <null>
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: Starting PEngine
> Recheck Timer
> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_process_diff: Diff 0.210.56 ->
> 0.210.57 not applied to 0.210.50: current "num_updates" is less than required
> Mar 16 01:35:30 oan1 cib: [7597]: WARN: cib_server_process_diff: Not requesting
> full refresh in R/W mode
> Mar 16 01:35:30 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: instance=14, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=14)
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: instance=14, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=14)
> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=14, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail:      CURRENT: oan1
> [nodeid=0, born=14]
> Mar 16 01:35:30 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
> Mar 16 01:35:31 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=15)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=15)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail:      CURRENT: oan1
> [nodeid=0, born=15]
> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_process_request: Operation complete:
> op cib_modify for section nodes (origin=local/crmd/205, version=0.210.51): ok
> (rc=0)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


lars.ellenberg at linbit

Mar 20, 2012, 2:46 PM

Post #4 of 5 (599 views)
Permalink
Re: Sporadic problems of rejoin after split brain situation [In reply to]

On Sun, Mar 18, 2012 at 11:53:11PM +0000, Roman Sidler wrote:
> Hi all
> No, we have no crossover cable between the nodes.
> The 2 nodes are linked by a switched network, and this works really fine except
> the mentioned case.
>
> It's rather easy to reproduce.
> 1. Activate a 2-node cluster
> 2. disconnect network connection (e.g by disconnect network adapter on VM)
> 3. wait until both nodes are active and act as DC
> 4. reconnect nodes
> 5. the new DC is elected
>
> When step 4 encounters an unsteady network, sometimes step 5 will never be
> reached and both nodes stays DC. They're are sending and receiving heartbeat
> status messages.
>
> The situation may be simulated by some repeatedly disconnects/connects
>
> Versions:
> pacemaker 1.1.6 (and 1.0.5)
> heartbeat 3.0.7 (and 3.0.0)

There is no heartbeat 3.0.7 (yet), so I guess you meant 2.0.7.
And no doubt that has a few problems.

You are supposed to use latest heartbeat (now: 3.0.5)
http://hg.linux-ha.org/heartbeat-STABLE_3_0/

"Unsteady network": I suppose that means at least packet loss.
There have been long standing bugs with retransmit and packet loss
in pacemaker which I only fixed in 3.0.5.

There is one more problem in that area I'm aware of, which is much less
likely to trigger, but if it does, you'll know, as heartbeat will start
spinning with high cpu load, and not even notice that a peer node is dead.

It has been introduced in 2005 or so, is a glib mainloop callback
priority inversion involving timeout based events, only triggers
sometimes, may need even special config parameters in ha.cf to trigger
at all (and needs massive packet loss ...). And it is even fixed as
well. But as I only had one complaint that remotely matched these
symptoms, it is still not in the repos.

Hopefully we find some time to clean that up
and release it with heartbeat 3.0.6.

As we finally noticed that some things still packaged with heartbeat
actually belong into glue, we seem to have to cut a release anyways,
"soon".

BTW, I've seen similar behaviour with corosync as well.
In fact, for me, this exact scenario (node returning after having been
declared dead) typically does NOT work with pacemaker on corosync,
while it always worked with pacemaker on heartbeat...

Still, Andrew is right: for new clusters, corosync is the way to go.

Not so long ago, I posted some corosync vs heartbeat summary
from my current perspective:
http://permalink.gmane.org/gmane.linux.highavailability.user/36903

> *** pacemaker 1.0.5 ***

That's not the most recent, either.
iirc, there have been some fixes in pacemaker as well,
in the area of rejoining after partition.

But really: you need to fix your "unsteady" network,
and probably should implement stonith.

> Mar 8 07:05:50 LAB19 heartbeat: [2979]: CRIT: Cluster node lab13 returning
> after partition.
> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: For information on cluster
> partitions, See URL: http://linux-ha.org/SplitBrain
> Mar 8 07:05:50 LAB19 heartbeat: [2979]: WARN: Deadtime value may be too small.
> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: See FAQ for information on tuning
> deadtime.
> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: URL: http://linux-
> ha.org/FAQ#heavy_load
> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: Link lab13:eth0 up.
> Mar 8 07:05:50 LAB19 heartbeat: [2979]: WARN: Late heartbeat: Node lab13:
> interval 244130 ms
> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: Status update for node lab13:
> status active
> Mar 8 07:05:50 LAB19 crmd: [3083]: notice: crmd_ha_status_callback: Status
> update: Node lab13 now has status [active] (DC=true)
> Mar 8 07:05:50 LAB19 crmd: [3083]: info: crm_update_peer_proc: lab13.ais is now
> online
> Mar 8 07:05:50 LAB19 cib: [3079]: WARN: cib_peer_callback: Discarding
> cib_apply_diff message (14a33) from lab13: not in our membership
> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: all clients are now paused
> Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
> Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: instance=15,
> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 8 07:05:51 LAB19 crmd: [3083]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=15)
> Mar 8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar 8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: CURRENT: lab19
> [nodeid=1, born=15]
> Mar 8 07:05:51 LAB19 crmd: [3083]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=15)
> Mar 8 07:05:51 LAB19 heartbeat: [2979]: info: all clients are now resumed
> Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=16, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=16)
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=17, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=17)
> Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
> Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: instance=16,
> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> :
>
> *** pacemaker 1.1.6 ***
> Mar 16 01:35:28 oan1 heartbeat: [7556]: CRIT: Cluster node oan3 returning after
> partition.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: For information on cluster
> partitions, See URL: http://linux-ha.org/wiki/Split_Brain
> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Deadtime value may be too small.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: See FAQ for information on tuning
> deadtime.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: URL: http://linux-
> ha.org/wiki/FAQ#Heavy_Load
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Link oan3:eth0 up.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Late heartbeat: Node oan3:
> interval 73300 ms
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Status update for node oan3:
> status active
> Mar 16 01:35:28 oan1 crmd: [7601]: notice: crmd_ha_status_callback: Status
> update: Node oan3 now has status [active]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: crm_update_peer_proc: oan3.ais is now
> online
> Mar 16 01:35:28 oan1 crmd: [7601]: WARN: match_down_event: No match for shutdown
> action on a9dae67b-b98f-498b-ad1b-950fc73c2ab2
> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_update_diff: Stonith/shutdown of
> a9dae67b-b98f-498b-ad1b-950fc73c2ab2 not matched
> Mar 16 01:35:28 oan1 crmd: [7601]: info: abort_transition_graph:
> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state,
> id=a9dae67b-b98f-498b-ad1b-950fc73c2ab2, magic=NA, cib=0.210.50) : Node failure
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: All 1 cluster
> nodes are eligible to run resources.
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke: Query 204: Requesting the
> current CIB: S_POLICY_ENGINE
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke_callback: Invoking the PE:
> query=204, ref=pe_calc-dc-1331858128-123, seq=13, quorate=1
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
> rsc_clustercontrol (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rscIpaddr2
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rsc_asterisk
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rsc_tomcat5
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
> rsc_ntp_refresh (Started oan1)
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: unpack_graph: Unpacked transition 16: 0
> actions in 0 synapses
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_te_invoke: Processing graph 16
> (ref=pe_calc-dc-1331858128-123) derived from /var/lib/pengine/pe-input-201.bz2
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: process_pe_message: Transition
> 16: PEngine Input stored in: /var/lib/pengine/pe-input-201.bz2
> Mar 16 01:35:28 oan1 crmd: [7601]: info: run_graph:
> ====================================================
> Mar 16 01:35:28 oan1 crmd: [7601]: notice: run_graph: Transition 16 (Complete=0,
> Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-
> 201.bz2): Complete
> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_graph_trigger: Transition 16 is now
> complete
> Mar 16 01:35:28 oan1 crmd: [7601]: info: notify_crmd: Transition 16 status: done
> - <null>
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: Starting PEngine
> Recheck Timer
> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_process_diff: Diff 0.210.56 ->
> 0.210.57 not applied to 0.210.50: current "num_updates" is less than required
> Mar 16 01:35:30 oan1 cib: [7597]: WARN: cib_server_process_diff: Not requesting
> full refresh in R/W mode
> Mar 16 01:35:30 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: instance=14, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=14)
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: instance=14, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=14)
> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=14, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: CURRENT: oan1
> [nodeid=0, born=14]
> Mar 16 01:35:30 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
> Mar 16 01:35:31 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=15)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=15)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: CURRENT: oan1
> [nodeid=0, born=15]
> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_process_request: Operation complete:
> op cib_modify for section nodes (origin=local/crmd/205, version=0.210.51): ok
> (rc=0)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andrew at beekhof

Mar 20, 2012, 3:02 PM

Post #5 of 5 (602 views)
Permalink
Re: Sporadic problems of rejoin after split brain situation [In reply to]

On Wed, Mar 21, 2012 at 8:46 AM, Lars Ellenberg
<lars.ellenberg [at] linbit> wrote:
> On Sun, Mar 18, 2012 at 11:53:11PM +0000, Roman Sidler wrote:
>> Hi all
>> No, we have no crossover cable between the nodes.
>> The 2 nodes are linked by a switched network, and this works really fine except
>> the mentioned case.
>>
>> It's rather easy to reproduce.
>> 1. Activate a 2-node cluster
>> 2. disconnect network connection (e.g by disconnect network adapter on VM)
>> 3. wait until both nodes are active and act as DC
>> 4. reconnect nodes
>> 5. the new DC is elected
>>
>> When step 4 encounters an unsteady network, sometimes step 5 will never be
>> reached and both nodes stays DC. They're are sending and receiving heartbeat
>> status messages.
>>
>> The situation may be simulated by some repeatedly disconnects/connects
>>
>> Versions:
>> pacemaker 1.1.6 (and 1.0.5)
>> heartbeat 3.0.7 (and 3.0.0)
>
> There is no heartbeat 3.0.7 (yet), so I guess you meant 2.0.7.
> And no doubt that has a few problems.
>
> You are supposed to use latest heartbeat (now: 3.0.5)
> http://hg.linux-ha.org/heartbeat-STABLE_3_0/
>
> "Unsteady network": I suppose that means at least packet loss.
> There have been long standing bugs with retransmit and packet loss
> in pacemaker which I only fixed in 3.0.5.
>
> There is one more problem in that area I'm aware of, which is much less
> likely to trigger, but if it does, you'll know, as heartbeat will start
> spinning with high cpu load, and not even notice that a peer node is dead.
>
> It has been introduced in 2005 or so, is a glib mainloop callback
> priority inversion involving timeout based events, only triggers
> sometimes, may need even special config parameters in ha.cf to trigger
> at all (and needs massive packet loss ...). And it is even fixed as
> well. But as I only had one complaint that remotely matched these
> symptoms, it is still not in the repos.
>
> Hopefully we find some time to clean that up
> and release it with heartbeat 3.0.6.
>
> As we finally noticed that some things still packaged with heartbeat
> actually belong into glue, we seem to have to cut a release anyways,
> "soon".
>
> BTW, I've seen similar behaviour with corosync as well.
> In fact, for me, this exact scenario (node returning after having been
> declared dead) typically does NOT work with pacemaker on corosync,
> while it always worked with pacemaker on heartbeat...

I believe things are much improved with more recent releases.
Hence the recommendation of 1.4.x :-)

>
> Still, Andrew is right: for new clusters, corosync is the way to go.
>
> Not so long ago, I posted some corosync vs heartbeat summary
> from my current perspective:
> http://permalink.gmane.org/gmane.linux.highavailability.user/36903
>
>> *** pacemaker 1.0.5 ***
>
> That's not the most recent, either.
> iirc, there have been some fixes in pacemaker as well,
> in the area of rejoining after partition.
>
> But really: you need to fix your "unsteady" network,
> and probably should implement stonith.
>
>> Mar  8 07:05:50 LAB19 heartbeat: [2979]: CRIT: Cluster node lab13 returning
>> after partition.
>> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: For information on cluster
>> partitions, See URL: http://linux-ha.org/SplitBrain
>> Mar  8 07:05:50 LAB19 heartbeat: [2979]: WARN: Deadtime value may be too small.
>> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: See FAQ for information on tuning
>> deadtime.
>> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: URL: http://linux-
>> ha.org/FAQ#heavy_load
>> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: Link lab13:eth0 up.
>> Mar  8 07:05:50 LAB19 heartbeat: [2979]: WARN: Late heartbeat: Node lab13:
>> interval 244130 ms
>> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: Status update for node lab13:
>> status active
>> Mar  8 07:05:50 LAB19 crmd: [3083]: notice: crmd_ha_status_callback: Status
>> update: Node lab13 now has status [active] (DC=true)
>> Mar  8 07:05:50 LAB19 crmd: [3083]: info: crm_update_peer_proc: lab13.ais is now
>> online
>> Mar  8 07:05:50 LAB19 cib: [3079]: WARN: cib_peer_callback: Discarding
>> cib_apply_diff message (14a33) from lab13: not in our membership
>> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: all clients are now paused
>> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
>> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
>> OC_EV_MS_INVALID from ccm
>> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
>> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
>> OC_EV_MS_NEW_MEMBERSHIP from ccm
>> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: instance=15,
>> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
>> Mar  8 07:05:51 LAB19 crmd: [3083]: info: crmd_ccm_msg_callback: Quorum
>> (re)attained after event=NEW MEMBERSHIP (id=15)
>> Mar  8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: NEW MEMBERSHIP:
>> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
>> Mar  8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: CURRENT: lab19
>> [nodeid=1, born=15]
>> Mar  8 07:05:51 LAB19 crmd: [3083]: info: populate_cib_nodes_ha: Requesting the
>> list of configured nodes
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
>> OC_EV_MS_INVALID from ccm
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
>> OC_EV_MS_NEW_MEMBERSHIP from ccm
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=15, nodes=1,
>> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
>> event=NEW MEMBERSHIP (id=15)
>> Mar  8 07:05:51 LAB19 heartbeat: [2979]: info: all clients are now resumed
>> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
>> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
>> OC_EV_MS_INVALID from ccm
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
>> OC_EV_MS_NEW_MEMBERSHIP from ccm
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=16, nodes=1,
>> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
>> event=NEW MEMBERSHIP (id=16)
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
>> OC_EV_MS_INVALID from ccm
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
>> OC_EV_MS_NEW_MEMBERSHIP from ccm
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=17, nodes=1,
>> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
>> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
>> event=NEW MEMBERSHIP (id=17)
>> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
>> OC_EV_MS_INVALID from ccm
>> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
>> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
>> OC_EV_MS_NEW_MEMBERSHIP from ccm
>> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: instance=16,
>> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
>> :
>>
>> *** pacemaker 1.1.6 ***
>> Mar 16 01:35:28 oan1 heartbeat: [7556]: CRIT: Cluster node oan3 returning after
>> partition.
>> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: For information on cluster
>> partitions, See URL: http://linux-ha.org/wiki/Split_Brain
>> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Deadtime value may be too small.
>> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: See FAQ for information on tuning
>> deadtime.
>> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: URL: http://linux-
>> ha.org/wiki/FAQ#Heavy_Load
>> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Link oan3:eth0 up.
>> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Late heartbeat: Node oan3:
>> interval 73300 ms
>> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Status update for node oan3:
>> status active
>> Mar 16 01:35:28 oan1 crmd: [7601]: notice: crmd_ha_status_callback: Status
>> update: Node oan3 now has status [active]
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: crm_update_peer_proc: oan3.ais is now
>> online
>> Mar 16 01:35:28 oan1 crmd: [7601]: WARN: match_down_event: No match for shutdown
>> action on a9dae67b-b98f-498b-ad1b-950fc73c2ab2
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_update_diff: Stonith/shutdown of
>> a9dae67b-b98f-498b-ad1b-950fc73c2ab2 not matched
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: abort_transition_graph:
>> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state,
>> id=a9dae67b-b98f-498b-ad1b-950fc73c2ab2, magic=NA, cib=0.210.50) : Node failure
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
>> S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
>> origin=abort_transition_graph ]
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: All 1 cluster
>> nodes are eligible to run resources.
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke: Query 204: Requesting the
>> current CIB: S_POLICY_ENGINE
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke_callback: Invoking the PE:
>> query=204, ref=pe_calc-dc-1331858128-123, seq=13, quorate=1
>> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
>> rsc_clustercontrol    (Started oan1)
>> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rscIpaddr2
>> (Started oan1)
>> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rsc_asterisk
>> (Started oan1)
>> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rsc_tomcat5
>> (Started oan1)
>> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
>> rsc_ntp_refresh       (Started oan1)
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
>> S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
>> origin=handle_response ]
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: unpack_graph: Unpacked transition 16: 0
>> actions in 0 synapses
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_te_invoke: Processing graph 16
>> (ref=pe_calc-dc-1331858128-123) derived from /var/lib/pengine/pe-input-201.bz2
>> Mar 16 01:35:28 oan1 pengine: [17673]: notice: process_pe_message: Transition
>> 16: PEngine Input stored in: /var/lib/pengine/pe-input-201.bz2
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: run_graph:
>> ====================================================
>> Mar 16 01:35:28 oan1 crmd: [7601]: notice: run_graph: Transition 16 (Complete=0,
>> Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-
>> 201.bz2): Complete
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_graph_trigger: Transition 16 is now
>> complete
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: notify_crmd: Transition 16 status: done
>> - <null>
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
>> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
>> origin=notify_crmd ]
>> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: Starting PEngine
>> Recheck Timer
>> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_process_diff: Diff 0.210.56 ->
>> 0.210.57 not applied to 0.210.50: current "num_updates" is less than required
>> Mar 16 01:35:30 oan1 cib: [7597]: WARN: cib_server_process_diff: Not requesting
>> full refresh in R/W mode
>> Mar 16 01:35:30 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
>> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
>> OC_EV_MS_INVALID from ccm
>> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
>> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
>> OC_EV_MS_NEW_MEMBERSHIP from ccm
>> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: instance=14, nodes=1,
>> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
>> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
>> event=NEW MEMBERSHIP (id=14)
>> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
>> OC_EV_MS_INVALID from ccm
>> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
>> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
>> OC_EV_MS_NEW_MEMBERSHIP from ccm
>> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: instance=14, nodes=1,
>> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
>> Mar 16 01:35:30 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
>> (re)attained after event=NEW MEMBERSHIP (id=14)
>> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
>> trans=14, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
>> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail:    CURRENT: oan1
>> [nodeid=0, born=14]
>> Mar 16 01:35:30 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
>> list of configured nodes
>> Mar 16 01:35:31 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
>> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
>> OC_EV_MS_INVALID from ccm
>> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
>> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
>> OC_EV_MS_NEW_MEMBERSHIP from ccm
>> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: instance=15, nodes=1,
>> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
>> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
>> OC_EV_MS_INVALID from ccm
>> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
>> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
>> OC_EV_MS_NEW_MEMBERSHIP from ccm
>> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: instance=15, nodes=1,
>> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
>> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
>> event=NEW MEMBERSHIP (id=15)
>> Mar 16 01:35:31 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
>> (re)attained after event=NEW MEMBERSHIP (id=15)
>> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
>> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
>> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail:    CURRENT: oan1
>> [nodeid=0, born=15]
>> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_process_request: Operation complete:
>> op cib_modify for section nodes (origin=local/crmd/205, version=0.210.51): ok
>> (rc=0)
>> Mar 16 01:35:31 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
>> list of configured nodes
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.