Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

Nodes will not promote DRBD resources to master on failover

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


amartin at xes-inc

Jan 25, 2012, 11:58 AM

Post #1 of 16 (1915 views)
Permalink
Nodes will not promote DRBD resources to master on failover

Hello,


Recently I finished configuring a two-node cluster with pacemaker 1.1.6 and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes the following resources:
- primitives for DRBD storage devices
- primitives for mounting the filesystem on the DRBD storage
- primitives for some mount binds
- primitive for starting apache
- primitives for starting samba and nfs servers (following instructions here )
- primitives for exporting nfs shares (ocf:heartbeat:exportfs)


Perhaps this is best described through the output of crm_mon:

Online: [ node1 node2 ]


Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
p_drbd_mount1:0 (ocf::linbit:drbd): Started node2 (unmanaged)
p_drbd_mount1:1 (ocf::linbit:drbd): Started node1 (unmanaged) FAILED
Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
p_drbd_mount2:0 (ocf::linbit:drbd): Master node1 (unmanaged) FAILED
Slaves: [ node2 ]
Resource Group: g_core
p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
Resource Group: g_apache
p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
p_apache (ocf::heartbeat:apache): Started node1
Resource Group: g_fileservers
p_lsb_smb (lsb:smbd): Started node1
p_lsb_nmb (lsb:nmbd): Started node1
p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1


I have read through the Pacemaker Explained documentation, however could not find a way to further debug these problems. First, I put node1 into standby mode to attempt failover to the other node (node2). Node2 appeared to start the transition to master, however it failed to promote the DRBD resources to master (the first step). I have attached a copy of this session in commands.log and additional excerpts from /var/log/syslog during important steps. I have attempted everything I can think of to try and start the DRBD resource (e.g. start/stop/promote/manage/cleanup under crm resource , restarting heartbeat) but cannot bring it out of the slave state. However, if I set it to unmanaged and then run drbdadm primary all in the terminal, pacemaker is satisfied and continues starting the rest of the resources. It then failed when attempting to mount the filesystem for mount2, the p_fs_mount2 resource. I attempted to mount the filesystem myself and was successful. I then unmounted it and ran cleanup on p_fs_mount2 and then it mounted. The rest of the resources started as expected until the p_exportfs_mount2 resource, which failed as follows:
p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2 (unmanaged) FAILED


I ran cleanup on this and it started, however when running this test earlier today no command could successfully start this exportfs resource.


How can I configure pacemaker to better resolve these problems and be able to bring the node up successfully on its own? What can I check to determine why these failures are occuring? /var/log/syslog did not seem to contain very much useful information regarding why the failures occurred.


Thanks,


Andrew
Attachments: commands.log (3.53 KB)
  cleanup-ms_drbd_mount1.log (28.2 KB)
  cleanup-p_fs_mount2.log (36.4 KB)
  cleanup-p_exportfs_mount2.log (19.1 KB)


dejanmm at fastmail

Jan 26, 2012, 5:01 AM

Post #2 of 16 (1855 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

Hi,

On Wed, Jan 25, 2012 at 01:58:44PM -0600, Andrew Martin wrote:
> Hello,
>
>
> Recently I finished configuring a two-node cluster with pacemaker 1.1.6 and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes the following resources:
> - primitives for DRBD storage devices
> - primitives for mounting the filesystem on the DRBD storage
> - primitives for some mount binds
> - primitive for starting apache
> - primitives for starting samba and nfs servers (following instructions here )
> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>
>
> Perhaps this is best described through the output of crm_mon:
>
> Online: [ node1 node2 ]
>
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2 (unmanaged)
> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1 (unmanaged) FAILED
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1 (unmanaged) FAILED
> Slaves: [ node2 ]
> Resource Group: g_core
> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
> Resource Group: g_apache
> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
> p_apache (ocf::heartbeat:apache): Started node1
> Resource Group: g_fileservers
> p_lsb_smb (lsb:smbd): Started node1
> p_lsb_nmb (lsb:nmbd): Started node1
> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
>
>
> I have read through the Pacemaker Explained documentation, however could not find a way to further debug these problems. First, I put node1 into standby mode to attempt failover to the other node (node2). Node2 appeared to start the transition to master, however it failed to promote the DRBD resources to master (the first step). I have attached a copy of this session in commands.log and additional excerpts from /var/log/syslog during important steps. I have attempted everything I can think of to try and start the DRBD resource (e.g. start/stop/promote/manage/cleanup under crm resource , restarting heartbeat) but cannot bring it out of the slave state. However, if I set it to unmanaged and then run drbdadm primary all in the terminal, pacemaker is satisfied and continues starting the rest of the resources. It then failed when attempting to mount the filesystem for mount2, the p_fs_mount2 resource. I attempted to mount the filesystem myself and was successful. I then unmounted it and ran cleanup on p_fs_mount2 and then it mounted. The rest of the resources started as expected until the p_exportfs_mount2 resource, which failed as follows:
> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2 (unmanaged) FAILED
>
>
> I ran cleanup on this and it started, however when running this test earlier today no command could successfully start this exportfs resource.
>
>
> How can I configure pacemaker to better resolve these problems and be able to bring the node up successfully on its own? What can I check to determine why these failures are occuring? /var/log/syslog did not seem to contain very much useful information regarding why the failures occurred.

All of the issues you encountered seemed to be specific to
resources you want to run. Best to stop the cluster and use
ocf-tester to see where could be a problem. If the problems are
intermittent, then you'll need to dig through the logs---look for
messages coming from lrmd and the RA itself.

Thanks,

Dejan


>
> Thanks,
>
>
> Andrew
>
>
>

> Node node1 (d6fbcffe-ff23-4031-b108-428eabe19b7b): standby
> Online: [ node2 ]
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> Slaves: [ node2 ]
> Stopped: [ p_drbd_mount1:1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Slaves: [ node2 ]
> Stopped: [ p_drbd_mount2:1 ]
>
> Failed actions:
> p_fs_mount2_start_0 (node=node2, call=78, rc=1, status=complete): unknown error
> p_drbd_mount2:1_monitor_30000 (node=node2, call=86, rc=7, status=complete): not running
>
>
> # mount2 resource cleanup ms_drbd_mount1
> ---> see cleanup-ms_drbd_mount1.log
>
> # mount2 resource unmanage p_drbd_mount1
> # mount2_mon -1
> Node node1 (d6fbcffe-ff23-4031-b108-428eabe19b7b): standby
> Online: [ node2 ]
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> p_drbd_mount1:0 (ocf::linbit:drbd): Master node2 (unmanaged)
> Stopped: [ p_drbd_mount1:1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Slaves: [ node2 ]
> Stopped: [ p_drbd_mount2:1 ]
>
> Failed actions:
> p_fs_mount2_start_0 (node=node2, call=78, rc=1, status=complete): unknown error
> p_drbd_mount2:1_monitor_30000 (node=node2, call=86, rc=7, status=complete): not running
>
>
> # mount2 resource unmanage p_drbd_mount2
> # drbdadm primary all
> # mount2_mon -1
> Node node1 (d6fbcffe-ff23-4031-b108-428eabe19b7b): standby
> Online: [ node2 ]
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> p_drbd_mount1:0 (ocf::linbit:drbd): Master node2 (unmanaged)
> Stopped: [ p_drbd_mount1:1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] (unmanaged)
> p_drbd_mount2:0 (ocf::linbit:drbd): Master node2 (unmanaged)
> Stopped: [ p_drbd_mount2:1 ]
> Resource Group: g_core
> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node2
> p_fs_mount2 (ocf::heartbeat:Filesystem): Stopped
> p_ip_nfs (ocf::heartbeat:IPaddr2): Stopped
>
> Failed actions:
> p_fs_mount2_start_0 (node=node2, call=78, rc=1, status=complete): unknown error
> p_drbd_mount2:1_monitor_30000 (node=node2, call=86, rc=7, status=complete): not running
>
> # mount2 resource cleanup p_fs_mount2
> ---> see cleanup-p_fs_mount2.log
> # mount2_mon -1
> Node node1 (d6fbcffe-ff23-4031-b108-428eabe19b7b): standby
> Online: [ node2 ]
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> p_drbd_mount1:0 (ocf::linbit:drbd): Master node2 (unmanaged)
> Stopped: [ p_drbd_mount1:1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] (unmanaged)
> p_drbd_mount2:0 (ocf::linbit:drbd): Master node2 (unmanaged)
> Stopped: [ p_drbd_mount2:1 ]
> Resource Group: g_core
> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node2
> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node2
> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node2
> Resource Group: g_apache
> p_fs_varsupport (ocf::heartbeat:Filesystem): Started node2
> p_fs_webdataeco (ocf::heartbeat:Filesystem): Started node2
> p_fs_webdatactl (ocf::heartbeat:Filesystem): Started node2
> p_fs_varwww (ocf::heartbeat:Filesystem): Started node2
> p_apache (ocf::heartbeat:apache): Started node2
> Resource Group: g_fileservers
> p_lsb_smb (lsb:smbd): Started node2
> p_lsb_nmb (lsb:nmbd): Started node2
> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node2
> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node2
> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2 (unmanaged) FAILED
>
> # mount2 resource cleanup p_exportfs_mount2
> ---> see cleanup-p_exportfs_mount2.log
>
>

> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_drbd_mount1:0'] (origin=local/crmd/212, version=1.186.4): ok (rc=0)
> crmd: [6852]: info: delete_resource: Removing resource p_drbd_mount1:0 for 29626_mount2_resource (internal) on node2
> crmd: [6852]: info: lrm_remove_deleted_op: Removing op p_drbd_mount1:0_monitor_30000:151 for deleted resource p_drbd_mount1:0
> crmd: [6852]: info: notify_deleted: Notifying 29626_mount2_resource on node2 that p_drbd_mount1:0 was deleted
> crmd: [6852]: WARN: decode_transition_key: Bad UUID (mount2-resource-29626) in sscanf result (3) for 0:0:mount2-resource-29626
> crmd: [6852]: info: send_direct_ack: ACK'ing resource op p_drbd_mount1:0_delete_60000 from 0:0:mount2-resource-29626: lrm_invoke-lrmd-1327518060-185
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_drbd_mount1:0'] (origin=local/crmd/213, version=1.186.5): ok (rc=0)
> crmd: [6852]: notice: do_lrm_invoke: Not creating resource for a delete event: (null)
> crmd: [6852]: info: notify_deleted: Notifying 29626_mount2_resource on node2 that p_drbd_mount1:1 was deleted
> crmd: [6852]: WARN: decode_transition_key: Bad UUID (mount2-resource-29626) in sscanf result (3) for 0:0:mount2-resource-29626
> crmd: [6852]: info: send_direct_ack: ACK'ing resource op p_drbd_mount1:1_delete_60000 from 0:0:mount2-resource-29626: lrm_invoke-lrmd-1327518060-186
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="186" num_updates="5" >
> cib: [6848]: info: cib:diff: - <configuration >
> cib: [6848]: info: cib:diff: - <mount2_config >
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327518038" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="187" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node1" update-client="crmd" cib-last-written="Wed Jan 25 13:00:39 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327518060" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=local/crmd/215, version=1.187.1): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_drbd_mount1:1'] (origin=local/crmd/216, version=1.187.2): ok (rc=0)
> crmd: [6852]: WARN: decode_transition_key: Bad UUID (mount2-resource-29626) in sscanf result (3) for 0:0:mount2-resource-29626
> crmd: [6852]: info: send_direct_ack: ACK'ing resource op p_drbd_mount1:1_delete_60000 from 0:0:mount2-resource-29626: lrm_invoke-lrmd-1327518060-187
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_drbd_mount1:0_last_failure_0, magic=0:0;6:10:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.186.5) : Resource op removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_drbd_mount1:0_last_failure_0, magic=0:0;6:10:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.186.5) : Resource op removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=1.187.1) : Non-status change
> crmd: [6852]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 219: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 220: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 221: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=local/crmd/218, version=1.187.3): ok (rc=0)
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=221, ref=pe_calc-dc-1327518060-188, seq=4, quorate=1
> lrmd: [6849]: info: rsc:p_drbd_mount1:0 probe[152] (pid 29627)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> crmd: [6852]: info: unpack_graph: Unpacked transition 14: 12 actions in 12 synapses
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> crmd: [6852]: info: do_te_invoke: Processing graph 14 (ref=pe_calc-dc-1327518060-188) derived from /var/lib/pengine/pe-input-996.bz2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> crmd: [6852]: info: te_rsc_command: Initiating action 6: monitor p_drbd_mount1:0_monitor_0 on node2 (local)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=6:14:7:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_drbd_mount1:0_monitor_0 )
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_fs_mount2_last_failure_0 on node2: unknown error (1)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 11 fired and confirmed
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> crmd: [6852]: info: te_pseudo_action: Pseudo action 12 fired and confirmed
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 9 fired and confirmed
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999998 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: WARN: common_apply_stickiness: Forcing p_fs_mount2 away from node2 after 1000000 failures (max=1000000)
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_drbd_mount1:0 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_drbd_mount1:0 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varsupport on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind3 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varwww on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: LogActions: Start p_drbd_mount1:0#011(node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Stopped)
> pengine: [29654]: notice: LogActions: Start p_fs_varsupport#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind2#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind3#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_varwww#011(node2)
> pengine: [29654]: notice: LogActions: Start p_apache#011(node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount2#011(Stopped)
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> lrmd: [6849]: info: operation monitor[152] on p_drbd_mount1:0 for client 6852: pid 29627 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_drbd_mount1:0_monitor_0 (call=152, rc=0, cib-update=223, confirmed=true) ok
> crmd: [6852]: WARN: status_from_rc: Action 6 (p_drbd_mount1:0_monitor_0) on node2 failed (target: 7 vs. rc: 0): Error
> crmd: [6852]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_drbd_mount1:0_last_failure_0, magic=0:0;6:14:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.187.4) : Event failed
> crmd: [6852]: info: update_abort_priority: Abort priority upgraded from 0 to 1
> crmd: [6852]: info: update_abort_priority: Abort action done superceeded by restart
> crmd: [6852]: info: match_graph_event: Action p_drbd_mount1:0_monitor_0 (6) confirmed on node2 (rc=4)
> crmd: [6852]: info: te_rsc_command: Initiating action 5: probe_complete probe_complete on node2 (local) - no waiting
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 14 (Complete=5, Pending=0, Fired=0, Skipped=3, Incomplete=4, Source=/var/lib/pengine/pe-input-996.bz2): Stopped
> crmd: [6852]: info: te_graph_trigger: Transition 14 is now complete
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 224: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=224, ref=pe_calc-dc-1327518060-191, seq=4, quorate=1
> pengine: [29654]: notice: process_pe_message: Transition 14: PEngine Input stored in: /var/lib/pengine/pe-input-996.bz2
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_fs_mount2_last_failure_0 on node2: unknown error (1)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999998 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: WARN: common_apply_stickiness: Forcing p_fs_mount2 away from node2 after 1000000 failures (max=1000000)
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_drbd_mount1:0 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_drbd_mount1:0 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varsupport on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind3 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varwww on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Stopped)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> pengine: [29654]: notice: LogActions: Start p_fs_varsupport#011(node2)
> crmd: [6852]: info: unpack_graph: Unpacked transition 15: 1 actions in 1 synapses
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind2#011(node2)
> crmd: [6852]: info: do_te_invoke: Processing graph 15 (ref=pe_calc-dc-1327518060-191) derived from /var/lib/pengine/pe-input-997.bz2
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind3#011(node2)
> lrmd: [6849]: info: rsc:p_drbd_mount1:0 monitor[153] (pid 29656)
> crmd: [6852]: info: te_rsc_command: Initiating action 9: monitor p_drbd_mount1:0_monitor_30000 on node2 (local)
> pengine: [29654]: notice: LogActions: Start p_fs_varwww#011(node2)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=9:15:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_drbd_mount1:0_monitor_30000 )
> pengine: [29654]: notice: LogActions: Start p_apache#011(node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount2#011(Stopped)
> heartbeat: [6835]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status took too long to execute: 60 ms (> 50 ms) (GSource: 0x225f400)
> pengine: [29654]: notice: process_pe_message: Transition 15: PEngine Input stored in: /var/lib/pengine/pe-input-997.bz2
> lrmd: [6849]: info: operation monitor[153] on p_drbd_mount1:0 for client 6852: pid 29656 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_drbd_mount1:0_monitor_30000 (call=153, rc=0, cib-update=225, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_drbd_mount1:0_monitor_30000 (9) confirmed on node2 (rc=0)
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 15 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-997.bz2): Complete
> crmd: [6852]: info: te_graph_trigger: Transition 15 is now complete
> crmd: [6852]: info: notify_crmd: Transition 15 status: done - <null>
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: Starting PEngine Recheck Timer
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.187.5): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_drbd_mount1:0'] (origin=node1/crmd/34, version=1.187.5): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_drbd_mount1:0_last_0, magic=0:7;6:13:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.187.6) : Resource op removal
> crmd: [6852]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 226: Requesting the current CIB: S_POLICY_ENGINE
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=226, ref=pe_calc-dc-1327518061-193, seq=4, quorate=1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_fs_mount2_last_failure_0 on node2: unknown error (1)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999998 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: WARN: common_apply_stickiness: Forcing p_fs_mount2 away from node2 after 1000000 failures (max=1000000)
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varsupport on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind3 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varwww on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Stopped)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> pengine: [29654]: notice: LogActions: Start p_fs_varsupport#011(node2)
> crmd: [6852]: info: unpack_graph: Unpacked transition 16: 0 actions in 0 synapses
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind2#011(node2)
> crmd: [6852]: info: do_te_invoke: Processing graph 16 (ref=pe_calc-dc-1327518061-193) derived from /var/lib/pengine/pe-input-998.bz2
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind3#011(node2)
> crmd: [6852]: info: run_graph: ====================================================
> pengine: [29654]: notice: LogActions: Start p_fs_varwww#011(node2)
> crmd: [6852]: notice: run_graph: Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-998.bz2): Complete
> pengine: [29654]: notice: LogActions: Start p_apache#011(node2)
> crmd: [6852]: info: te_graph_trigger: Transition 16 is now complete
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Stopped)
> crmd: [6852]: info: notify_crmd: Transition 16 status: done - <null>
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Stopped)
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Stopped)
> crmd: [6852]: info: do_state_transition: Starting PEngine Recheck Timer
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount2#011(Stopped)
> pengine: [29654]: notice: process_pe_message: Transition 16: PEngine Input stored in: /var/lib/pengine/pe-input-998.bz2
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.187.5): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.187.5): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_drbd_mount1:0_last_0, magic=0:7;6:13:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.187.7) : Resource op removal
> crmd: [6852]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 227: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_drbd_mount1:0'] (origin=node1/crmd/35, version=1.187.7): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=1.188.1) : Non-status change
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="187" num_updates="7" >
> crmd: [6852]: info: do_pe_invoke: Query 228: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib:diff: - <configuration >
> cib: [6848]: info: cib:diff: - <mount2_config >
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327518060" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="188" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node2" update-client="crmd" cib-last-written="Wed Jan 25 13:01:00 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327518061" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=node1/crmd/37, version=1.188.1): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_drbd_mount1:1'] (origin=node1/crmd/38, version=1.188.2): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=node1/crmd/40, version=1.188.3): ok (rc=0)
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=228, ref=pe_calc-dc-1327518062-194, seq=4, quorate=1
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_fs_mount2_last_failure_0 on node2: unknown error (1)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999998 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: WARN: common_apply_stickiness: Forcing p_fs_mount2 away from node2 after 1000000 failures (max=1000000)
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varsupport on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind3 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varwww on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Stopped)
> pengine: [29654]: notice: LogActions: Start p_fs_varsupport#011(node2)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind2#011(node2)
> crmd: [6852]: info: unpack_graph: Unpacked transition 17: 3 actions in 3 synapses
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind3#011(node2)
> crmd: [6852]: info: do_te_invoke: Processing graph 17 (ref=pe_calc-dc-1327518062-194) derived from /var/lib/pengine/pe-input-999.bz2
> pengine: [29654]: notice: LogActions: Start p_fs_varwww#011(node2)
> crmd: [6852]: info: te_rsc_command: Initiating action 6: monitor p_drbd_mount1:0_monitor_0 on node1
> pengine: [29654]: notice: LogActions: Start p_apache#011(node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount2#011(Stopped)
> pengine: [29654]: notice: process_pe_message: Transition 17: PEngine Input stored in: /var/lib/pengine/pe-input-999.bz2
> crmd: [6852]: info: match_graph_event: Action p_drbd_mount1:0_monitor_0 (6) confirmed on node1 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 5: probe_complete probe_complete on node1 - no waiting
> crmd: [6852]: info: te_pseudo_action: Pseudo action 4 fired and confirmed
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 17 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-999.bz2): Complete
> crmd: [6852]: info: te_graph_trigger: Transition 17 is now complete
> crmd: [6852]: info: notify_crmd: Transition 17 status: done - <null>
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: Starting PEngine Recheck Timer

> attrd: [6851]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-p_fs_mount2 (<null>)
> attrd: [6851]: notice: attrd_perform_update: Sent delete 631: node=039e53da-dce8-4fd7-84bc-7261682529e8, attr=fail-count-p_fs_mount2, id=<n/a>, set=(null), section=status
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_fs_mount2'] (origin=local/crmd/249, version=1.191.2): ok (rc=0)
> crmd: [6852]: info: delete_resource: Removing resource p_fs_mount2 for 1045_mount2_resource (internal) on node2
> crmd: [6852]: info: notify_deleted: Notifying 1045_mount2_resource on node2 that p_fs_mount2 was deleted
> crmd: [6852]: WARN: decode_transition_key: Bad UUID (mount2-resource-1045) in sscanf result (3) for 0:0:mount2-resource-1045
> crmd: [6852]: info: send_direct_ack: ACK'ing resource op p_fs_mount2_delete_60000 from 0:0:mount2-resource-1045: lrm_invoke-lrmd-1327519062-213
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_fs_mount2'] (origin=local/crmd/250, version=1.191.3): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:178 - Triggered transition abort (complete=1, tag=transient_attributes, id=039e53da-dce8-4fd7-84bc-7261682529e8, magic=NA, cib=1.191.2) : Transient attribute: removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_fs_mount2_last_0, magic=0:0;4:66:0:7c970883-1991-4dfe-bcef-ea0b59dff8ed, cib=1.191.3) : Resource op removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_fs_mount2_last_0, magic=0:0;4:66:0:7c970883-1991-4dfe-bcef-ea0b59dff8ed, cib=1.191.3) : Resource op removal
> crmd: [6852]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 253: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 254: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 255: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="191" num_updates="3" >
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=1.192.1) : Non-status change
> cib: [6848]: info: cib:diff: - <configuration >
> crmd: [6852]: info: do_pe_invoke: Query 256: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib:diff: - <mount2_config >
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327518061" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="192" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node2" update-client="cibadmin" cib-last-written="Wed Jan 25 13:17:01 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327519062" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=local/crmd/252, version=1.192.1): ok (rc=0)
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=256, ref=pe_calc-dc-1327519062-214, seq=4, quorate=1
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999997 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount2:0 to remain promoted on node2
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount1:0 to remain promoted on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mount2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_ip_nfs on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind1 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind3 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varwww on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_smb on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_nmb on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_nfsserver on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount1 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount2 on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mount2#011(node2)
> pengine: [29654]: notice: LogActions: Start p_ip_nfs#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind1#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind2#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind3#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_varwww#011(node2)
> pengine: [29654]: notice: LogActions: Start p_apache#011(node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_smb#011(node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_nmb#011(node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_nfsserver#011(node2)
> pengine: [29654]: notice: LogActions: Start p_exportfs_mount1#011(node2)
> pengine: [29654]: notice: LogActions: Start p_exportfs_mount2#011(node2)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> crmd: [6852]: info: unpack_graph: Unpacked transition 25: 33 actions in 33 synapses
> crmd: [6852]: info: do_te_invoke: Processing graph 25 (ref=pe_calc-dc-1327519062-214) derived from /var/lib/pengine/pe-input-1007.bz2
> crmd: [6852]: info: te_pseudo_action: Pseudo action 74 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 8: monitor p_fs_mount2_monitor_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=8:25:7:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mount2_monitor_0 )
> lrmd: [6849]: info: rsc:p_fs_mount2 probe[158] (pid 1047)
> lrmd: [6849]: info: operation monitor[158] on p_fs_mount2 for client 6852: pid 1047 exited with return code 7
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mount2_monitor_0 (call=158, rc=7, cib-update=258, confirmed=true) not running
> crmd: [6852]: info: match_graph_event: Action p_fs_mount2_monitor_0 (8) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 7: probe_complete probe_complete on node2 (local) - no waiting
> crmd: [6852]: info: te_pseudo_action: Pseudo action 5 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 70: start p_fs_mount2_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=70:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mount2_start_0 )
> lrmd: [6849]: info: rsc:p_fs_mount2 start[159] (pid 1079)
> pengine: [29654]: notice: process_pe_message: Transition 25: PEngine Input stored in: /var/lib/pengine/pe-input-1007.bz2
> Filesystem[1079]: [1108]: INFO: Running start for /dev/drbd1 on /mnt/storage/Mount1
> lrmd: [6849]: info: RA output: (p_fs_mount2:start:stderr) FATAL: Module scsi_hostadapter not found.
> kernel: [433319.639211] kjournald starting. Commit interval 5 seconds
> kernel: [433319.639229] EXT3-fs warning: maximal mount count reached, running e2fsck is recommended
> kernel: [433319.639705] EXT3 FS on drbd1, internal journal
> kernel: [433319.639716] EXT3-fs: mounted filesystem with ordered data mode.
> lrmd: [6849]: info: operation start[159] on p_fs_mount2 for client 6852: pid 1079 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mount2_start_0 (call=159, rc=0, cib-update=259, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mount2_start_0 (70) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 71: monitor p_fs_mount2_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=71:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mount2_monitor_10000 )
> lrmd: [6849]: info: rsc:p_fs_mount2 monitor[160] (pid 1133)
> crmd: [6852]: info: te_rsc_command: Initiating action 72: start p_ip_nfs_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=72:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_ip_nfs_start_0 )
> lrmd: [6849]: info: rsc:p_ip_nfs start[161] (pid 1134)
> lrmd: [6849]: info: operation monitor[160] on p_fs_mount2 for client 6852: pid 1133 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mount2_monitor_10000 (call=160, rc=0, cib-update=260, confirmed=false) ok
> IPaddr2[1134]: [1197]: INFO: ip -f inet addr add 10.52.4.8/16 brd 10.52.255.255 dev eth0
> crmd: [6852]: info: match_graph_event: Action p_fs_mount2_monitor_10000 (71) confirmed on node2 (rc=0)
> IPaddr2[1134]: [1200]: INFO: ip link set eth0 up
> IPaddr2[1134]: [1203]: INFO: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.52.4.8 eth0 10.52.4.8 auto not_used not_used
> lrmd: [6849]: info: operation start[161] on p_ip_nfs for client 6852: pid 1134 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_ip_nfs_start_0 (call=161, rc=0, cib-update=261, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_ip_nfs_start_0 (72) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 75 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 73: monitor p_ip_nfs_monitor_30000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=73:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_ip_nfs_monitor_30000 )
> lrmd: [6849]: info: rsc:p_ip_nfs monitor[162] (pid 1212)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 88 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 78: start p_fs_mountbind1_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=78:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind1_start_0 )
> lrmd: [6849]: info: rsc:p_fs_mountbind1 start[163] (pid 1214)
> lrmd: [6849]: info: operation monitor[162] on p_ip_nfs for client 6852: pid 1212 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_ip_nfs_monitor_30000 (call=162, rc=0, cib-update=262, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_ip_nfs_monitor_30000 (73) confirmed on node2 (rc=0)
> Filesystem[1214]: [1270]: INFO: Running start for /mnt/storage/Mount2/Support on /var/support
> lrmd: [6849]: info: RA output: (p_fs_mountbind1:start:stderr) FATAL: Module scsi_hostadapter not found.
> lrmd: [6849]: info: operation start[163] on p_fs_mountbind1 for client 6852: pid 1214 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind1_start_0 (call=163, rc=0, cib-update=263, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind1_start_0 (78) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 79: monitor p_fs_mountbind1_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=79:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind1_monitor_10000 )
> lrmd: [6849]: info: rsc:p_fs_mountbind1 monitor[164] (pid 1286)
> crmd: [6852]: info: te_rsc_command: Initiating action 80: start p_fs_mountbind2_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=80:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind2_start_0 )
> lrmd: [6849]: info: rsc:p_fs_mountbind2 start[165] (pid 1287)
> lrmd: [6849]: info: operation monitor[164] on p_fs_mountbind1 for client 6852: pid 1286 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind1_monitor_10000 (call=164, rc=0, cib-update=264, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind1_monitor_10000 (79) confirmed on node2 (rc=0)
> Filesystem[1287]: [1347]: INFO: Running start for /mnt/storage/Mount2/Products/eco on /var/support/web/data/eco
> lrmd: [6849]: info: RA output: (p_fs_mountbind2:start:stderr) FATAL: Module scsi_hostadapter not found.
> lrmd: [6849]: info: operation start[165] on p_fs_mountbind2 for client 6852: pid 1287 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind2_start_0 (call=165, rc=0, cib-update=265, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind2_start_0 (80) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 81: monitor p_fs_mountbind2_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=81:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind2_monitor_10000 )
> lrmd: [6849]: info: rsc:p_fs_mountbind2 monitor[166] (pid 1363)
> crmd: [6852]: info: te_rsc_command: Initiating action 82: start p_fs_mountbind3_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=82:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind3_start_0 )
> lrmd: [6849]: info: rsc:p_fs_mountbind3 start[167] (pid 1364)
> lrmd: [6849]: info: operation monitor[166] on p_fs_mountbind2 for client 6852: pid 1363 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind2_monitor_10000 (call=166, rc=0, cib-update=266, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind2_monitor_10000 (81) confirmed on node2 (rc=0)
> Filesystem[1364]: [1423]: INFO: Running start for /mnt/storage/Mount2/ctl on /var/support/web/data/ctl
> lrmd: [6849]: info: RA output: (p_fs_mountbind3:start:stderr) FATAL: Module scsi_hostadapter not found.
> lrmd: [6849]: info: operation start[167] on p_fs_mountbind3 for client 6852: pid 1364 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind3_start_0 (call=167, rc=0, cib-update=267, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind3_start_0 (82) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 83: monitor p_fs_mountbind3_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=83:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind3_monitor_10000 )
> lrmd: [6849]: info: rsc:p_fs_mountbind3 monitor[168] (pid 1461)
> crmd: [6852]: info: te_rsc_command: Initiating action 84: start p_fs_varwww_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=84:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_varwww_start_0 )
> lrmd: [6849]: info: rsc:p_fs_varwww start[169] (pid 1463)
> lrmd: [6849]: info: operation monitor[168] on p_fs_mountbind3 for client 6852: pid 1461 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind3_monitor_10000 (call=168, rc=0, cib-update=268, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind3_monitor_10000 (83) confirmed on node2 (rc=0)
> Filesystem[1463]: [1534]: INFO: Running start for /mnt/storage/Mount2/var/www on /var/www
> lrmd: [6849]: info: RA output: (p_fs_varwww:start:stderr) FATAL: Module scsi_hostadapter not found.
> lrmd: [6849]: info: operation start[169] on p_fs_varwww for client 6852: pid 1463 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_varwww_start_0 (call=169, rc=0, cib-update=269, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_varwww_start_0 (84) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 85: monitor p_fs_varwww_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=85:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_varwww_monitor_10000 )
> lrmd: [6849]: info: rsc:p_fs_varwww monitor[170] (pid 1562)
> crmd: [6852]: info: te_rsc_command: Initiating action 86: start p_apache_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=86:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_apache_start_0 )
> lrmd: [6849]: info: rsc:p_apache start[171] (pid 1566)
> lrmd: [6849]: info: operation monitor[170] on p_fs_varwww for client 6852: pid 1562 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_varwww_monitor_10000 (call=170, rc=0, cib-update=270, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_varwww_monitor_10000 (85) confirmed on node2 (rc=0)
> apache[1566]: [1643]: INFO: apache not running
> apache[1566]: [1646]: INFO: waiting for apache /etc/apache2/apache2.conf to come up
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.192.14): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_fs_mount2'] (origin=node1/crmd/44, version=1.192.14): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_fs_mount2_last_0, magic=0:7;9:8:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.192.15) : Resource op removal
> crmd: [6852]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
> crmd: [6852]: info: update_abort_priority: Abort action done superceeded by restart
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.192.14): ok (rc=0)
> cib: last message repeated 2 times
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_fs_mount2'] (origin=node1/crmd/45, version=1.192.17): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_fs_mount2_last_0, magic=0:7;9:8:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.192.17) : Resource op removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=0, tag=diff, id=(null), magic=NA, cib=1.193.1) : Non-status change
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="192" num_updates="17" >
> cib: [6848]: info: cib:diff: - <configuration >
> cib: [6848]: info: cib:diff: - <mount2_config >
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327519062" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="193" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node2" update-client="crmd" cib-last-written="Wed Jan 25 13:17:42 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327519064" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=node1/crmd/47, version=1.193.1): ok (rc=0)
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> lrmd: [6849]: info: operation start[171] on p_apache for client 6852: pid 1566 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_apache_start_0 (call=171, rc=0, cib-update=272, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_apache_start_0 (86) confirmed on node2 (rc=0)
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 25 (Complete=19, Pending=0, Fired=0, Skipped=14, Incomplete=0, Source=/var/lib/pengine/pe-input-1007.bz2): Stopped
> crmd: [6852]: info: te_graph_trigger: Transition 25 is now complete
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 273: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=273, ref=pe_calc-dc-1327519065-230, seq=4, quorate=1
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999997 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount2:0 to remain promoted on node2
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount1:0 to remain promoted on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_smb on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_nmb on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_nfsserver on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount1 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount2 on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind3#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_varwww#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_apache#011(Started node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_smb#011(node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_nmb#011(node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_nfsserver#011(node2)
> pengine: [29654]: notice: LogActions: Start p_exportfs_mount1#011(node2)
> pengine: [29654]: notice: LogActions: Start p_exportfs_mount2#011(node2)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> crmd: [6852]: info: unpack_graph: Unpacked transition 26: 16 actions in 16 synapses
> crmd: [6852]: info: do_te_invoke: Processing graph 26 (ref=pe_calc-dc-1327519065-230) derived from /var/lib/pengine/pe-input-1008.bz2
> crmd: [6852]: info: te_rsc_command: Initiating action 13: monitor p_fs_mount2_monitor_0 on node1
> crmd: [6852]: info: te_rsc_command: Initiating action 94: monitor p_apache_monitor_40000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=94:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_apache_monitor_40000 )
> lrmd: [6849]: info: rsc:p_apache monitor[172] (pid 1669)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 109 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 99: start p_lsb_smb_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=99:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_smb_start_0 )
> lrmd: [6849]: info: rsc:p_lsb_smb start[173] (pid 1670)
> lrmd: [1670]: WARN: For LSB init script, no additional parameters are needed.
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout) Rather than invoking init scripts through /etc/init.d, use the service(8)
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout) utility, e.g. service smbd start
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout)
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout) Since the script you are attempting to invoke has been converted to an
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout) Upstart job, you may also use the start(8) utility, e.g. start smbd
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout) smbd start/running, process 1690
> lrmd: [6849]: info: operation start[173] on p_lsb_smb for client 6852: pid 1670 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_smb_start_0 (call=173, rc=0, cib-update=274, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_lsb_smb_start_0 (99) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 100: monitor p_lsb_smb_monitor_30000 on node2 (local)
> lrmd: [6849]: info: rsc:p_lsb_smb monitor[174] (pid 1697)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=100:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_smb_monitor_30000 )
> crmd: [6852]: info: te_rsc_command: Initiating action 101: start p_lsb_nmb_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=101:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_nmb_start_0 )
> lrmd: [6849]: info: rsc:p_lsb_nmb start[175] (pid 1699)
> lrmd: [1699]: WARN: For LSB init script, no additional parameters are needed.
> lrmd: [6849]: info: operation monitor[174] on p_lsb_smb for client 6852: pid 1697 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_smb_monitor_30000 (call=174, rc=0, cib-update=275, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_lsb_smb_monitor_30000 (100) confirmed on node2 (rc=0)
> lrmd: [6849]: info: operation monitor[172] on p_apache for client 6852: pid 1669 exited with return code 0
> lrmd: [6849]: info: RA output: (p_lsb_nmb:start:stdout) Rather than invoking init scripts through /etc/init.d, use the service(8)#012utility, e.g. service nmbd start#012#012Since the script you are attempting to invoke has been converted to an#012Upstart job, you may also use the start(8) utility, e.g. start nmbd
> crmd: [6852]: info: process_lrm_event: LRM operation p_apache_monitor_40000 (call=172, rc=0, cib-update=276, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_apache_monitor_40000 (94) confirmed on node2 (rc=0)
> pengine: [29654]: notice: process_pe_message: Transition 26: PEngine Input stored in: /var/lib/pengine/pe-input-1008.bz2
> smbd[1690]: [2012/01/25 13:17:45, 0] smbd/server.c:1115(main)
> lrmd: [6849]: info: RA output: (p_lsb_nmb:start:stdout) nmbd start/running, process 1730
> lrmd: [6849]: info: operation start[175] on p_lsb_nmb for client 6852: pid 1699 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_nmb_start_0 (call=175, rc=0, cib-update=277, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_lsb_nmb_start_0 (101) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 102: monitor p_lsb_nmb_monitor_30000 on node2 (local)
> lrmd: [6849]: info: rsc:p_lsb_nmb monitor[176] (pid 1731)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=102:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_nmb_monitor_30000 )
> crmd: [6852]: info: te_rsc_command: Initiating action 103: start p_lsb_nfsserver_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=103:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_nfsserver_start_0 )
> smbd[1690]: standard input is not a socket, assuming -D option
> lrmd: [6849]: info: rsc:p_lsb_nfsserver start[177] (pid 1732)
> lrmd: [1732]: WARN: For LSB init script, no additional parameters are needed.
> lrmd: [6849]: info: operation monitor[176] on p_lsb_nmb for client 6852: pid 1731 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_nmb_monitor_30000 (call=176, rc=0, cib-update=278, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_lsb_nmb_monitor_30000 (102) confirmed on node2 (rc=0)
> lrmd: [6849]: info: RA output: (p_lsb_nfsserver:start:stdout) * Exporting directories for NFS kernel daemon...
> lrmd: [6849]: info: RA output: (p_lsb_nfsserver:start:stdout) ...done.#012 * Starting NFS kernel daemon
> kernel: [433321.967109] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
> kernel: [433321.967129] NFSD: unable to find recovery directory /var/lib/nfs/v4recovery
> kernel: [433321.967142] NFSD: starting 10-second grace period
> lrmd: [6849]: info: RA output: (p_lsb_nfsserver:start:stdout) ...done.
> lrmd: [6849]: info: operation start[177] on p_lsb_nfsserver for client 6852: pid 1732 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_nfsserver_start_0 (call=177, rc=0, cib-update=279, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_lsb_nfsserver_start_0 (103) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 104: monitor p_lsb_nfsserver_monitor_30000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=104:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_nfsserver_monitor_30000 )
> lrmd: [6849]: info: rsc:p_lsb_nfsserver monitor[178] (pid 1755)
> crmd: [6852]: info: te_rsc_command: Initiating action 105: start p_exportfs_mount1_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=105:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount1_start_0 )
> lrmd: [6849]: info: rsc:p_exportfs_mount1 start[179] (pid 1757)
> exportfs[1757]: [1762]: INFO: Directory /mnt/storage/Mount2 is not exported to 192.168.0.0/255.255.255.0 (stopped).
> lrmd: [6849]: info: operation monitor[178] on p_lsb_nfsserver for client 6852: pid 1755 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_nfsserver_monitor_30000 (call=178, rc=0, cib-update=280, confirmed=false) ok
> exportfs[1757]: [1764]: INFO: Exporting file system ...
> crmd: [6852]: info: match_graph_event: Action p_lsb_nfsserver_monitor_30000 (104) confirmed on node2 (rc=0)
> exportfs[1757]: [1771]: INFO: exporting 192.168.0.0/255.255.255.0:/mnt/storage/Mount2
> exportfs[1757]: [1775]: INFO: File system exported
> lrmd: [6849]: info: operation start[179] on p_exportfs_mount1 for client 6852: pid 1757 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_exportfs_mount1_start_0 (call=179, rc=0, cib-update=281, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_exportfs_mount1_start_0 (105) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 106: monitor p_exportfs_mount1_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=106:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount1_monitor_10000 )
> lrmd: [6849]: info: rsc:p_exportfs_mount1 monitor[180] (pid 1780)
> crmd: [6852]: info: te_rsc_command: Initiating action 107: start p_exportfs_mount2_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=107:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount2_start_0 )
> lrmd: [6849]: info: rsc:p_exportfs_mount2 start[181] (pid 1781)
> exportfs[1781]: [1791]: INFO: Directory /mnt/storage/Mount1 is not exported to 192.168.0.0/255.255.255.0 (stopped).
> exportfs[1780]: [1790]: INFO: Directory /mnt/storage/Mount2 is exported to 192.168.0.0/255.255.255.0 (started).
> exportfs[1781]: [1793]: INFO: Exporting file system ...
> exportfs[1781]: [1801]: INFO: exporting 192.168.0.0/255.255.255.0:/mnt/storage/Mount1
> exportfs[1781]: [1805]: INFO: File system exported
> lrmd: [6849]: info: operation start[181] on p_exportfs_mount2 for client 6852: pid 1781 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_exportfs_mount2_start_0 (call=181, rc=0, cib-update=282, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_exportfs_mount2_start_0 (107) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 110 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 108: monitor p_exportfs_mount2_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=108:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount2_monitor_10000 )
> lrmd: [6849]: info: rsc:p_exportfs_mount2 monitor[182] (pid 1806)
> exportfs[1806]: [1811]: INFO: Directory /mnt/storage/Mount1 is exported to 192.168.0.0/255.255.255.0 (started).
> crmd: [6852]: info: match_graph_event: Action p_fs_mount2_monitor_0 (13) confirmed on node1 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 12: probe_complete probe_complete on node1 - no waiting
> crmd: [6852]: info: te_pseudo_action: Pseudo action 11 fired and confirmed

> attrd: [6851]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-p_exportfs_mount2 (<null>)
> attrd: [6851]: notice: attrd_perform_update: Sent delete 648: node=039e53da-dce8-4fd7-84bc-7261682529e8, attr=fail-count-p_exportfs_mount2, id=<n/a>, set=(null), section=status
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_exportfs_mount2'] (origin=local/crmd/294, version=1.193.21): ok (rc=0)
> crmd: [6852]: info: delete_resource: Removing resource p_exportfs_mount2 for 5885_mount2_resource (internal) on node2
> crmd: [6852]: info: notify_deleted: Notifying 5885_mount2_resource on node2 that p_exportfs_mount2 was deleted
> crmd: [6852]: WARN: decode_transition_key: Bad UUID (mount2-resource-5885) in sscanf result (3) for 0:0:mount2-resource-5885
> crmd: [6852]: info: send_direct_ack: ACK'ing resource op p_exportfs_mount2_delete_60000 from 0:0:mount2-resource-5885: lrm_invoke-lrmd-1327519232-253
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_exportfs_mount2'] (origin=local/crmd/295, version=1.193.22): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:178 - Triggered transition abort (complete=1, tag=transient_attributes, id=039e53da-dce8-4fd7-84bc-7261682529e8, magic=NA, cib=1.193.21) : Transient attribute: removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_exportfs_mount2_last_0, magic=0:0;107:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.193.22) : Resource op removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_exportfs_mount2_last_0, magic=0:0;107:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.193.22) : Resource op removal
> crmd: [6852]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 298: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 299: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 300: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="193" num_updates="22" >
> cib: [6848]: info: cib:diff: - <configuration >
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=1.194.1) : Non-status change
> cib: [6848]: info: cib:diff: - <mount2_config >
> crmd: [6852]: info: do_pe_invoke: Query 301: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327519064" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="194" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node1" update-client="crmd" cib-last-written="Wed Jan 25 13:17:45 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327519232" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=local/crmd/297, version=1.194.1): ok (rc=0)
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999997 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: notice: common_apply_stickiness: p_exportfs_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount2:0 to remain promoted on node2
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=301, ref=pe_calc-dc-1327519232-254, seq=4, quorate=1
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount1:0 to remain promoted on node2
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount2 on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind3#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_varwww#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_apache#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Start p_exportfs_mount2#011(node2)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> crmd: [6852]: info: unpack_graph: Unpacked transition 34: 7 actions in 7 synapses
> crmd: [6852]: info: do_te_invoke: Processing graph 34 (ref=pe_calc-dc-1327519232-254) derived from /var/lib/pengine/pe-input-1016.bz2
> crmd: [6852]: info: te_pseudo_action: Pseudo action 113 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 19: monitor p_exportfs_mount2_monitor_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=19:34:7:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount2_monitor_0 )
> lrmd: [6849]: info: rsc:p_exportfs_mount2 probe[184] (pid 5887)
> exportfs[5887]: [5892]: INFO: Directory /mnt/storage/Mount1 is exported to 192.168.0.0/255.255.255.0s (started).
> pengine: [29654]: notice: process_pe_message: Transition 34: PEngine Input stored in: /var/lib/pengine/pe-input-1016.bz2
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.194.1): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_exportfs_mount2_last_0, magic=0:7;20:8:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.194.2) : Resource op removal
> crmd: [6852]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
> crmd: [6852]: info: update_abort_priority: Abort action done superceeded by restart
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_exportfs_mount2'] (origin=node1/crmd/51, version=1.194.1): ok (rc=0)
> exportfs[5894]: [5899]: INFO: Directory /mnt/storage/Mount2 is exported to 192.168.0.0/255.255.255.0 (started).
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.194.1): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_exportfs_mount2_last_0, magic=0:7;20:8:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.194.2) : Resource op removal
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_exportfs_mount2'] (origin=node1/crmd/52, version=1.194.2): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=0, tag=diff, id=(null), magic=NA, cib=1.195.1) : Non-status change
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="194" num_updates="3" >
> cib: [6848]: info: cib:diff: - <configuration >
> cib: [6848]: info: cib:diff: - <mount2_config >
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327519232" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="195" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node2" update-client="crmd" cib-last-written="Wed Jan 25 13:20:32 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327519233" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=node1/crmd/54, version=1.195.1): ok (rc=0)
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> lrmd: [6849]: info: operation monitor[184] on p_exportfs_mount2 for client 6852: pid 5887 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_exportfs_mount2_monitor_0 (call=184, rc=0, cib-update=304, confirmed=true) ok
> crmd: [6852]: WARN: status_from_rc: Action 19 (p_exportfs_mount2_monitor_0) on node2 failed (target: 7 vs. rc: 0): Error
> crmd: [6852]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_exportfs_mount2_last_failure_0, magic=0:0;19:34:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.195.3) : Event failed
> crmd: [6852]: info: match_graph_event: Action p_exportfs_mount2_monitor_0 (19) confirmed on node2 (rc=4)
> crmd: [6852]: info: te_rsc_command: Initiating action 18: probe_complete probe_complete on node2 (local) - no waiting
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 34 (Complete=3, Pending=0, Fired=0, Skipped=4, Incomplete=0, Source=/var/lib/pengine/pe-input-1016.bz2): Stopped
> crmd: [6852]: info: te_graph_trigger: Transition 34 is now complete
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 305: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=305, ref=pe_calc-dc-1327519247-257, seq=4, quorate=1
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_exportfs_mount2_last_failure_0 found resource p_exportfs_mount2 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999997 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: notice: common_apply_stickiness: p_exportfs_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount2:0 to remain promoted on node2
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount1:0 to remain promoted on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount2 on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind3#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_varwww#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_apache#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount2#011(Started node2)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> crmd: [6852]: info: unpack_graph: Unpacked transition 35: 4 actions in 4 synapses
> crmd: [6852]: info: do_te_invoke: Processing graph 35 (ref=pe_calc-dc-1327519247-257) derived from /var/lib/pengine/pe-input-1017.bz2
> crmd: [6852]: info: te_rsc_command: Initiating action 113: monitor p_exportfs_mount2_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=113:35:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount2_monitor_10000 )
> lrmd: [6849]: info: rsc:p_exportfs_mount2 monitor[185] (pid 6376)
> crmd: [6852]: info: te_rsc_command: Initiating action 18: monitor p_exportfs_mount2_monitor_0 on node1
> exportfs[6376]: [6381]: INFO: Directory /mnt/storage/Mount1 is exported to 192.168.0.0/255.255.255.0 (started).
> pengine: [29654]: notice: process_pe_message: Transition 35: PEngine Input stored in: /var/lib/pengine/pe-input-1017.bz2
> crmd: [6852]: info: match_graph_event: Action p_exportfs_mount2_monitor_0 (18) confirmed on node1 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 17: probe_complete probe_complete on node1 - no waiting
> crmd: [6852]: info: te_pseudo_action: Pseudo action 16 fired and confirmed
> exportfs[6603]: [6608]: INFO: Directory /mnt/storage/Mount2 is exported to 192.168.0.0/255.255.255.0 (started).
> lrmd: [6849]: info: operation monitor[185] on p_exportfs_mount2 for client 6852: pid 6376 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_exportfs_mount2_monitor_10000 (call=185, rc=0, cib-update=306, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_exportfs_mount2_monitor_10000 (113) confirmed on node2 (rc=0)
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 35 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-1017.bz2): Complete
> crmd: [6852]: info: te_graph_trigger: Transition 35 is now complete
> crmd: [6852]: info: notify_crmd: Transition 35 status: done - <null>
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: Starting PEngine Recheck Timer
> exportfs[6909]: [6914]: INFO: Directory /mnt/storage/Mount1 is exported to 192.168.0.0/255.255.255.0 (started).
> exportfs[7358]: [7363]: INFO: Directory /mnt/storage/Mount2 is exported to 192.168.0.0/255.255.255.0 (started).
> exportfs[7424]: [7429]: INFO: Directory /mnt/storage/Mount1 is exported to 192.168.0.0/255.255.255.0 (started).
> ntpd[1213]: Listening on interface #10 eth0, 10.52.4.8#123 Enabled
> ntpd[1213]: new interface(s) found: waking up resolver

> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


amartin at xes-inc

Jan 26, 2012, 3:16 PM

Post #3 of 16 (1850 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

Hi Dejan,


Thanks for the tip, I have now tried ocf-tester with the DRBD resource on each node. In both cases it reports that there is nothing wrong with the DRBD resource's configuration:

# ocf-tester -n ms_drbd_mount1 -o drbd_resource="mount1" /usr/lib/ocf/resource.d/heartbeat/drbd
Beginning tests for /usr/lib/ocf/resource.d/heartbeat/drbd...
su: Permission denied
(Ignored)
/usr/lib/ocf/resource.d/heartbeat/drbd passed all tests


At this point the DRBD service is in slave mode on both nodes:

Online: [ node1 node2 ]


Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
Slaves: [ node2 node1 ]
Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
Slaves: [ node2 node1 ]


Attached is the output from /var/log/syslog if I attempt to start/promote/cleanup ms_drbd_mount1 or if I attempt to start/restart/cleanup p_drbd_mount1. The configuration of these resources is as follows:
crm(live)configure# show ms_drbd_mount1
ms ms_drbd_mount1 p_drbd_mount1 \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" is-managed="true"
crm(live)configure# show p_drbd_mount1

primitive p_drbd_tools ocf:linbit:drbd \
params drbd_resource="mount1" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="100" \
op monitor interval="15" role="Master" \
op monitor interval="30" role="Slave"


I tried reverting to an earlier version of my pacemaker config (cibadmin --replace --xml-file=old.xml) and this problem disappears. What else can I try to debug this?


Thanks,


Andrew

----- Original Message -----

From: "Dejan Muhamedagic" <dejanmm [at] fastmail>
To: "The Pacemaker cluster resource manager" <pacemaker [at] oss>
Sent: Thursday, January 26, 2012 7:01:26 AM
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover

Hi,

On Wed, Jan 25, 2012 at 01:58:44PM -0600, Andrew Martin wrote:
> Hello,
>
>
> Recently I finished configuring a two-node cluster with pacemaker 1.1.6 and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes the following resources:
> - primitives for DRBD storage devices
> - primitives for mounting the filesystem on the DRBD storage
> - primitives for some mount binds
> - primitive for starting apache
> - primitives for starting samba and nfs servers (following instructions here )
> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>
>
> Perhaps this is best described through the output of crm_mon:
>
> Online: [ node1 node2 ]
>
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2 (unmanaged)
> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1 (unmanaged) FAILED
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1 (unmanaged) FAILED
> Slaves: [ node2 ]
> Resource Group: g_core
> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
> Resource Group: g_apache
> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
> p_apache (ocf::heartbeat:apache): Started node1
> Resource Group: g_fileservers
> p_lsb_smb (lsb:smbd): Started node1
> p_lsb_nmb (lsb:nmbd): Started node1
> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
>
>
> I have read through the Pacemaker Explained documentation, however could not find a way to further debug these problems. First, I put node1 into standby mode to attempt failover to the other node (node2). Node2 appeared to start the transition to master, however it failed to promote the DRBD resources to master (the first step). I have attached a copy of this session in commands.log and additional excerpts from /var/log/syslog during important steps. I have attempted everything I can think of to try and start the DRBD resource (e.g. start/stop/promote/manage/cleanup under crm resource , restarting heartbeat) but cannot bring it out of the slave state. However, if I set it to unmanaged and then run drbdadm primary all in the terminal, pacemaker is satisfied and continues starting the rest of the resources. It then failed when attempting to mount the filesystem for mount2, the p_fs_mount2 resource. I attempted to mount the filesystem myself and was successful. I then unmount!
ed it and ran cleanup on p_fs_mount2 and then it mounted. The rest of the resources started as expected until the p_exportfs_mount2 resource, which failed as follows:
> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2 (unmanaged) FAILED
>
>
> I ran cleanup on this and it started, however when running this test earlier today no command could successfully start this exportfs resource.
>
>
> How can I configure pacemaker to better resolve these problems and be able to bring the node up successfully on its own? What can I check to determine why these failures are occuring? /var/log/syslog did not seem to contain very much useful information regarding why the failures occurred.

All of the issues you encountered seemed to be specific to
resources you want to run. Best to stop the cluster and use
ocf-tester to see where could be a problem. If the problems are
intermittent, then you'll need to dig through the logs---look for
messages coming from lrmd and the RA itself.

Thanks,

Dejan


>
> Thanks,
>
>
> Andrew
>
>
>

> Node node1 (d6fbcffe-ff23-4031-b108-428eabe19b7b): standby
> Online: [ node2 ]
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> Slaves: [ node2 ]
> Stopped: [ p_drbd_mount1:1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Slaves: [ node2 ]
> Stopped: [ p_drbd_mount2:1 ]
>
> Failed actions:
> p_fs_mount2_start_0 (node=node2, call=78, rc=1, status=complete): unknown error
> p_drbd_mount2:1_monitor_30000 (node=node2, call=86, rc=7, status=complete): not running
>
>
> # mount2 resource cleanup ms_drbd_mount1
> ---> see cleanup-ms_drbd_mount1.log
>
> # mount2 resource unmanage p_drbd_mount1
> # mount2_mon -1
> Node node1 (d6fbcffe-ff23-4031-b108-428eabe19b7b): standby
> Online: [ node2 ]
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> p_drbd_mount1:0 (ocf::linbit:drbd): Master node2 (unmanaged)
> Stopped: [ p_drbd_mount1:1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Slaves: [ node2 ]
> Stopped: [ p_drbd_mount2:1 ]
>
> Failed actions:
> p_fs_mount2_start_0 (node=node2, call=78, rc=1, status=complete): unknown error
> p_drbd_mount2:1_monitor_30000 (node=node2, call=86, rc=7, status=complete): not running
>
>
> # mount2 resource unmanage p_drbd_mount2
> # drbdadm primary all
> # mount2_mon -1
> Node node1 (d6fbcffe-ff23-4031-b108-428eabe19b7b): standby
> Online: [ node2 ]
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> p_drbd_mount1:0 (ocf::linbit:drbd): Master node2 (unmanaged)
> Stopped: [ p_drbd_mount1:1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] (unmanaged)
> p_drbd_mount2:0 (ocf::linbit:drbd): Master node2 (unmanaged)
> Stopped: [ p_drbd_mount2:1 ]
> Resource Group: g_core
> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node2
> p_fs_mount2 (ocf::heartbeat:Filesystem): Stopped
> p_ip_nfs (ocf::heartbeat:IPaddr2): Stopped
>
> Failed actions:
> p_fs_mount2_start_0 (node=node2, call=78, rc=1, status=complete): unknown error
> p_drbd_mount2:1_monitor_30000 (node=node2, call=86, rc=7, status=complete): not running
>
> # mount2 resource cleanup p_fs_mount2
> ---> see cleanup-p_fs_mount2.log
> # mount2_mon -1
> Node node1 (d6fbcffe-ff23-4031-b108-428eabe19b7b): standby
> Online: [ node2 ]
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> p_drbd_mount1:0 (ocf::linbit:drbd): Master node2 (unmanaged)
> Stopped: [ p_drbd_mount1:1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] (unmanaged)
> p_drbd_mount2:0 (ocf::linbit:drbd): Master node2 (unmanaged)
> Stopped: [ p_drbd_mount2:1 ]
> Resource Group: g_core
> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node2
> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node2
> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node2
> Resource Group: g_apache
> p_fs_varsupport (ocf::heartbeat:Filesystem): Started node2
> p_fs_webdataeco (ocf::heartbeat:Filesystem): Started node2
> p_fs_webdatactl (ocf::heartbeat:Filesystem): Started node2
> p_fs_varwww (ocf::heartbeat:Filesystem): Started node2
> p_apache (ocf::heartbeat:apache): Started node2
> Resource Group: g_fileservers
> p_lsb_smb (lsb:smbd): Started node2
> p_lsb_nmb (lsb:nmbd): Started node2
> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node2
> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node2
> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2 (unmanaged) FAILED
>
> # mount2 resource cleanup p_exportfs_mount2
> ---> see cleanup-p_exportfs_mount2.log
>
>

> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_drbd_mount1:0'] (origin=local/crmd/212, version=1.186.4): ok (rc=0)
> crmd: [6852]: info: delete_resource: Removing resource p_drbd_mount1:0 for 29626_mount2_resource (internal) on node2
> crmd: [6852]: info: lrm_remove_deleted_op: Removing op p_drbd_mount1:0_monitor_30000:151 for deleted resource p_drbd_mount1:0
> crmd: [6852]: info: notify_deleted: Notifying 29626_mount2_resource on node2 that p_drbd_mount1:0 was deleted
> crmd: [6852]: WARN: decode_transition_key: Bad UUID (mount2-resource-29626) in sscanf result (3) for 0:0:mount2-resource-29626
> crmd: [6852]: info: send_direct_ack: ACK'ing resource op p_drbd_mount1:0_delete_60000 from 0:0:mount2-resource-29626: lrm_invoke-lrmd-1327518060-185
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_drbd_mount1:0'] (origin=local/crmd/213, version=1.186.5): ok (rc=0)
> crmd: [6852]: notice: do_lrm_invoke: Not creating resource for a delete event: (null)
> crmd: [6852]: info: notify_deleted: Notifying 29626_mount2_resource on node2 that p_drbd_mount1:1 was deleted
> crmd: [6852]: WARN: decode_transition_key: Bad UUID (mount2-resource-29626) in sscanf result (3) for 0:0:mount2-resource-29626
> crmd: [6852]: info: send_direct_ack: ACK'ing resource op p_drbd_mount1:1_delete_60000 from 0:0:mount2-resource-29626: lrm_invoke-lrmd-1327518060-186
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="186" num_updates="5" >
> cib: [6848]: info: cib:diff: - <configuration >
> cib: [6848]: info: cib:diff: - <mount2_config >
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327518038" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="187" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node1" update-client="crmd" cib-last-written="Wed Jan 25 13:00:39 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327518060" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=local/crmd/215, version=1.187.1): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_drbd_mount1:1'] (origin=local/crmd/216, version=1.187.2): ok (rc=0)
> crmd: [6852]: WARN: decode_transition_key: Bad UUID (mount2-resource-29626) in sscanf result (3) for 0:0:mount2-resource-29626
> crmd: [6852]: info: send_direct_ack: ACK'ing resource op p_drbd_mount1:1_delete_60000 from 0:0:mount2-resource-29626: lrm_invoke-lrmd-1327518060-187
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_drbd_mount1:0_last_failure_0, magic=0:0;6:10:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.186.5) : Resource op removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_drbd_mount1:0_last_failure_0, magic=0:0;6:10:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.186.5) : Resource op removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=1.187.1) : Non-status change
> crmd: [6852]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 219: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 220: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 221: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=local/crmd/218, version=1.187.3): ok (rc=0)
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=221, ref=pe_calc-dc-1327518060-188, seq=4, quorate=1
> lrmd: [6849]: info: rsc:p_drbd_mount1:0 probe[152] (pid 29627)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> crmd: [6852]: info: unpack_graph: Unpacked transition 14: 12 actions in 12 synapses
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> crmd: [6852]: info: do_te_invoke: Processing graph 14 (ref=pe_calc-dc-1327518060-188) derived from /var/lib/pengine/pe-input-996.bz2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> crmd: [6852]: info: te_rsc_command: Initiating action 6: monitor p_drbd_mount1:0_monitor_0 on node2 (local)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=6:14:7:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_drbd_mount1:0_monitor_0 )
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_fs_mount2_last_failure_0 on node2: unknown error (1)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 11 fired and confirmed
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> crmd: [6852]: info: te_pseudo_action: Pseudo action 12 fired and confirmed
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 9 fired and confirmed
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999998 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: WARN: common_apply_stickiness: Forcing p_fs_mount2 away from node2 after 1000000 failures (max=1000000)
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_drbd_mount1:0 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_drbd_mount1:0 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varsupport on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind3 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varwww on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: LogActions: Start p_drbd_mount1:0#011(node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Stopped)
> pengine: [29654]: notice: LogActions: Start p_fs_varsupport#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind2#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind3#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_varwww#011(node2)
> pengine: [29654]: notice: LogActions: Start p_apache#011(node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount2#011(Stopped)
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> lrmd: [6849]: info: operation monitor[152] on p_drbd_mount1:0 for client 6852: pid 29627 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_drbd_mount1:0_monitor_0 (call=152, rc=0, cib-update=223, confirmed=true) ok
> crmd: [6852]: WARN: status_from_rc: Action 6 (p_drbd_mount1:0_monitor_0) on node2 failed (target: 7 vs. rc: 0): Error
> crmd: [6852]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_drbd_mount1:0_last_failure_0, magic=0:0;6:14:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.187.4) : Event failed
> crmd: [6852]: info: update_abort_priority: Abort priority upgraded from 0 to 1
> crmd: [6852]: info: update_abort_priority: Abort action done superceeded by restart
> crmd: [6852]: info: match_graph_event: Action p_drbd_mount1:0_monitor_0 (6) confirmed on node2 (rc=4)
> crmd: [6852]: info: te_rsc_command: Initiating action 5: probe_complete probe_complete on node2 (local) - no waiting
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 14 (Complete=5, Pending=0, Fired=0, Skipped=3, Incomplete=4, Source=/var/lib/pengine/pe-input-996.bz2): Stopped
> crmd: [6852]: info: te_graph_trigger: Transition 14 is now complete
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 224: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=224, ref=pe_calc-dc-1327518060-191, seq=4, quorate=1
> pengine: [29654]: notice: process_pe_message: Transition 14: PEngine Input stored in: /var/lib/pengine/pe-input-996.bz2
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_fs_mount2_last_failure_0 on node2: unknown error (1)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999998 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: WARN: common_apply_stickiness: Forcing p_fs_mount2 away from node2 after 1000000 failures (max=1000000)
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_drbd_mount1:0 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_drbd_mount1:0 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varsupport on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind3 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varwww on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Stopped)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> pengine: [29654]: notice: LogActions: Start p_fs_varsupport#011(node2)
> crmd: [6852]: info: unpack_graph: Unpacked transition 15: 1 actions in 1 synapses
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind2#011(node2)
> crmd: [6852]: info: do_te_invoke: Processing graph 15 (ref=pe_calc-dc-1327518060-191) derived from /var/lib/pengine/pe-input-997.bz2
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind3#011(node2)
> lrmd: [6849]: info: rsc:p_drbd_mount1:0 monitor[153] (pid 29656)
> crmd: [6852]: info: te_rsc_command: Initiating action 9: monitor p_drbd_mount1:0_monitor_30000 on node2 (local)
> pengine: [29654]: notice: LogActions: Start p_fs_varwww#011(node2)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=9:15:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_drbd_mount1:0_monitor_30000 )
> pengine: [29654]: notice: LogActions: Start p_apache#011(node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount2#011(Stopped)
> heartbeat: [6835]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status took too long to execute: 60 ms (> 50 ms) (GSource: 0x225f400)
> pengine: [29654]: notice: process_pe_message: Transition 15: PEngine Input stored in: /var/lib/pengine/pe-input-997.bz2
> lrmd: [6849]: info: operation monitor[153] on p_drbd_mount1:0 for client 6852: pid 29656 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_drbd_mount1:0_monitor_30000 (call=153, rc=0, cib-update=225, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_drbd_mount1:0_monitor_30000 (9) confirmed on node2 (rc=0)
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 15 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-997.bz2): Complete
> crmd: [6852]: info: te_graph_trigger: Transition 15 is now complete
> crmd: [6852]: info: notify_crmd: Transition 15 status: done - <null>
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: Starting PEngine Recheck Timer
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.187.5): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_drbd_mount1:0'] (origin=node1/crmd/34, version=1.187.5): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_drbd_mount1:0_last_0, magic=0:7;6:13:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.187.6) : Resource op removal
> crmd: [6852]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 226: Requesting the current CIB: S_POLICY_ENGINE
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=226, ref=pe_calc-dc-1327518061-193, seq=4, quorate=1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_fs_mount2_last_failure_0 on node2: unknown error (1)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999998 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: WARN: common_apply_stickiness: Forcing p_fs_mount2 away from node2 after 1000000 failures (max=1000000)
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varsupport on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind3 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varwww on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Stopped)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> pengine: [29654]: notice: LogActions: Start p_fs_varsupport#011(node2)
> crmd: [6852]: info: unpack_graph: Unpacked transition 16: 0 actions in 0 synapses
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind2#011(node2)
> crmd: [6852]: info: do_te_invoke: Processing graph 16 (ref=pe_calc-dc-1327518061-193) derived from /var/lib/pengine/pe-input-998.bz2
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind3#011(node2)
> crmd: [6852]: info: run_graph: ====================================================
> pengine: [29654]: notice: LogActions: Start p_fs_varwww#011(node2)
> crmd: [6852]: notice: run_graph: Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-998.bz2): Complete
> pengine: [29654]: notice: LogActions: Start p_apache#011(node2)
> crmd: [6852]: info: te_graph_trigger: Transition 16 is now complete
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Stopped)
> crmd: [6852]: info: notify_crmd: Transition 16 status: done - <null>
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Stopped)
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Stopped)
> crmd: [6852]: info: do_state_transition: Starting PEngine Recheck Timer
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount2#011(Stopped)
> pengine: [29654]: notice: process_pe_message: Transition 16: PEngine Input stored in: /var/lib/pengine/pe-input-998.bz2
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.187.5): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.187.5): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_drbd_mount1:0_last_0, magic=0:7;6:13:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.187.7) : Resource op removal
> crmd: [6852]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 227: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_drbd_mount1:0'] (origin=node1/crmd/35, version=1.187.7): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=1.188.1) : Non-status change
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="187" num_updates="7" >
> crmd: [6852]: info: do_pe_invoke: Query 228: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib:diff: - <configuration >
> cib: [6848]: info: cib:diff: - <mount2_config >
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327518060" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="188" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node2" update-client="crmd" cib-last-written="Wed Jan 25 13:01:00 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327518061" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=node1/crmd/37, version=1.188.1): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_drbd_mount1:1'] (origin=node1/crmd/38, version=1.188.2): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=node1/crmd/40, version=1.188.3): ok (rc=0)
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=228, ref=pe_calc-dc-1327518062-194, seq=4, quorate=1
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_fs_mount2_last_failure_0 on node2: unknown error (1)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999998 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: WARN: common_apply_stickiness: Forcing p_fs_mount2 away from node2 after 1000000 failures (max=1000000)
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varsupport on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind3 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varwww on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Slave node2)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Stopped)
> pengine: [29654]: notice: LogActions: Start p_fs_varsupport#011(node2)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind2#011(node2)
> crmd: [6852]: info: unpack_graph: Unpacked transition 17: 3 actions in 3 synapses
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind3#011(node2)
> crmd: [6852]: info: do_te_invoke: Processing graph 17 (ref=pe_calc-dc-1327518062-194) derived from /var/lib/pengine/pe-input-999.bz2
> pengine: [29654]: notice: LogActions: Start p_fs_varwww#011(node2)
> crmd: [6852]: info: te_rsc_command: Initiating action 6: monitor p_drbd_mount1:0_monitor_0 on node1
> pengine: [29654]: notice: LogActions: Start p_apache#011(node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Stopped)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount2#011(Stopped)
> pengine: [29654]: notice: process_pe_message: Transition 17: PEngine Input stored in: /var/lib/pengine/pe-input-999.bz2
> crmd: [6852]: info: match_graph_event: Action p_drbd_mount1:0_monitor_0 (6) confirmed on node1 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 5: probe_complete probe_complete on node1 - no waiting
> crmd: [6852]: info: te_pseudo_action: Pseudo action 4 fired and confirmed
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 17 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-999.bz2): Complete
> crmd: [6852]: info: te_graph_trigger: Transition 17 is now complete
> crmd: [6852]: info: notify_crmd: Transition 17 status: done - <null>
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: Starting PEngine Recheck Timer

> attrd: [6851]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-p_fs_mount2 (<null>)
> attrd: [6851]: notice: attrd_perform_update: Sent delete 631: node=039e53da-dce8-4fd7-84bc-7261682529e8, attr=fail-count-p_fs_mount2, id=<n/a>, set=(null), section=status
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_fs_mount2'] (origin=local/crmd/249, version=1.191.2): ok (rc=0)
> crmd: [6852]: info: delete_resource: Removing resource p_fs_mount2 for 1045_mount2_resource (internal) on node2
> crmd: [6852]: info: notify_deleted: Notifying 1045_mount2_resource on node2 that p_fs_mount2 was deleted
> crmd: [6852]: WARN: decode_transition_key: Bad UUID (mount2-resource-1045) in sscanf result (3) for 0:0:mount2-resource-1045
> crmd: [6852]: info: send_direct_ack: ACK'ing resource op p_fs_mount2_delete_60000 from 0:0:mount2-resource-1045: lrm_invoke-lrmd-1327519062-213
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_fs_mount2'] (origin=local/crmd/250, version=1.191.3): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:178 - Triggered transition abort (complete=1, tag=transient_attributes, id=039e53da-dce8-4fd7-84bc-7261682529e8, magic=NA, cib=1.191.2) : Transient attribute: removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_fs_mount2_last_0, magic=0:0;4:66:0:7c970883-1991-4dfe-bcef-ea0b59dff8ed, cib=1.191.3) : Resource op removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_fs_mount2_last_0, magic=0:0;4:66:0:7c970883-1991-4dfe-bcef-ea0b59dff8ed, cib=1.191.3) : Resource op removal
> crmd: [6852]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 253: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 254: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 255: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="191" num_updates="3" >
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=1.192.1) : Non-status change
> cib: [6848]: info: cib:diff: - <configuration >
> crmd: [6852]: info: do_pe_invoke: Query 256: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib:diff: - <mount2_config >
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327518061" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="192" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node2" update-client="cibadmin" cib-last-written="Wed Jan 25 13:17:01 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327519062" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=local/crmd/252, version=1.192.1): ok (rc=0)
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=256, ref=pe_calc-dc-1327519062-214, seq=4, quorate=1
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999997 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount2:0 to remain promoted on node2
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount1:0 to remain promoted on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mount2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_ip_nfs on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind1 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind2 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_mountbind3 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_fs_varwww on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_smb on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_nmb on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_nfsserver on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount1 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount2 on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mount2#011(node2)
> pengine: [29654]: notice: LogActions: Start p_ip_nfs#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind1#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind2#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_mountbind3#011(node2)
> pengine: [29654]: notice: LogActions: Start p_fs_varwww#011(node2)
> pengine: [29654]: notice: LogActions: Start p_apache#011(node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_smb#011(node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_nmb#011(node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_nfsserver#011(node2)
> pengine: [29654]: notice: LogActions: Start p_exportfs_mount1#011(node2)
> pengine: [29654]: notice: LogActions: Start p_exportfs_mount2#011(node2)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> crmd: [6852]: info: unpack_graph: Unpacked transition 25: 33 actions in 33 synapses
> crmd: [6852]: info: do_te_invoke: Processing graph 25 (ref=pe_calc-dc-1327519062-214) derived from /var/lib/pengine/pe-input-1007.bz2
> crmd: [6852]: info: te_pseudo_action: Pseudo action 74 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 8: monitor p_fs_mount2_monitor_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=8:25:7:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mount2_monitor_0 )
> lrmd: [6849]: info: rsc:p_fs_mount2 probe[158] (pid 1047)
> lrmd: [6849]: info: operation monitor[158] on p_fs_mount2 for client 6852: pid 1047 exited with return code 7
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mount2_monitor_0 (call=158, rc=7, cib-update=258, confirmed=true) not running
> crmd: [6852]: info: match_graph_event: Action p_fs_mount2_monitor_0 (8) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 7: probe_complete probe_complete on node2 (local) - no waiting
> crmd: [6852]: info: te_pseudo_action: Pseudo action 5 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 70: start p_fs_mount2_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=70:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mount2_start_0 )
> lrmd: [6849]: info: rsc:p_fs_mount2 start[159] (pid 1079)
> pengine: [29654]: notice: process_pe_message: Transition 25: PEngine Input stored in: /var/lib/pengine/pe-input-1007.bz2
> Filesystem[1079]: [1108]: INFO: Running start for /dev/drbd1 on /mnt/storage/Mount1
> lrmd: [6849]: info: RA output: (p_fs_mount2:start:stderr) FATAL: Module scsi_hostadapter not found.
> kernel: [433319.639211] kjournald starting. Commit interval 5 seconds
> kernel: [433319.639229] EXT3-fs warning: maximal mount count reached, running e2fsck is recommended
> kernel: [433319.639705] EXT3 FS on drbd1, internal journal
> kernel: [433319.639716] EXT3-fs: mounted filesystem with ordered data mode.
> lrmd: [6849]: info: operation start[159] on p_fs_mount2 for client 6852: pid 1079 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mount2_start_0 (call=159, rc=0, cib-update=259, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mount2_start_0 (70) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 71: monitor p_fs_mount2_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=71:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mount2_monitor_10000 )
> lrmd: [6849]: info: rsc:p_fs_mount2 monitor[160] (pid 1133)
> crmd: [6852]: info: te_rsc_command: Initiating action 72: start p_ip_nfs_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=72:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_ip_nfs_start_0 )
> lrmd: [6849]: info: rsc:p_ip_nfs start[161] (pid 1134)
> lrmd: [6849]: info: operation monitor[160] on p_fs_mount2 for client 6852: pid 1133 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mount2_monitor_10000 (call=160, rc=0, cib-update=260, confirmed=false) ok
> IPaddr2[1134]: [1197]: INFO: ip -f inet addr add 10.52.4.8/16 brd 10.52.255.255 dev eth0
> crmd: [6852]: info: match_graph_event: Action p_fs_mount2_monitor_10000 (71) confirmed on node2 (rc=0)
> IPaddr2[1134]: [1200]: INFO: ip link set eth0 up
> IPaddr2[1134]: [1203]: INFO: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.52.4.8 eth0 10.52.4.8 auto not_used not_used
> lrmd: [6849]: info: operation start[161] on p_ip_nfs for client 6852: pid 1134 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_ip_nfs_start_0 (call=161, rc=0, cib-update=261, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_ip_nfs_start_0 (72) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 75 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 73: monitor p_ip_nfs_monitor_30000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=73:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_ip_nfs_monitor_30000 )
> lrmd: [6849]: info: rsc:p_ip_nfs monitor[162] (pid 1212)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 88 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 78: start p_fs_mountbind1_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=78:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind1_start_0 )
> lrmd: [6849]: info: rsc:p_fs_mountbind1 start[163] (pid 1214)
> lrmd: [6849]: info: operation monitor[162] on p_ip_nfs for client 6852: pid 1212 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_ip_nfs_monitor_30000 (call=162, rc=0, cib-update=262, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_ip_nfs_monitor_30000 (73) confirmed on node2 (rc=0)
> Filesystem[1214]: [1270]: INFO: Running start for /mnt/storage/Mount2/Support on /var/support
> lrmd: [6849]: info: RA output: (p_fs_mountbind1:start:stderr) FATAL: Module scsi_hostadapter not found.
> lrmd: [6849]: info: operation start[163] on p_fs_mountbind1 for client 6852: pid 1214 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind1_start_0 (call=163, rc=0, cib-update=263, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind1_start_0 (78) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 79: monitor p_fs_mountbind1_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=79:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind1_monitor_10000 )
> lrmd: [6849]: info: rsc:p_fs_mountbind1 monitor[164] (pid 1286)
> crmd: [6852]: info: te_rsc_command: Initiating action 80: start p_fs_mountbind2_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=80:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind2_start_0 )
> lrmd: [6849]: info: rsc:p_fs_mountbind2 start[165] (pid 1287)
> lrmd: [6849]: info: operation monitor[164] on p_fs_mountbind1 for client 6852: pid 1286 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind1_monitor_10000 (call=164, rc=0, cib-update=264, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind1_monitor_10000 (79) confirmed on node2 (rc=0)
> Filesystem[1287]: [1347]: INFO: Running start for /mnt/storage/Mount2/Products/eco on /var/support/web/data/eco
> lrmd: [6849]: info: RA output: (p_fs_mountbind2:start:stderr) FATAL: Module scsi_hostadapter not found.
> lrmd: [6849]: info: operation start[165] on p_fs_mountbind2 for client 6852: pid 1287 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind2_start_0 (call=165, rc=0, cib-update=265, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind2_start_0 (80) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 81: monitor p_fs_mountbind2_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=81:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind2_monitor_10000 )
> lrmd: [6849]: info: rsc:p_fs_mountbind2 monitor[166] (pid 1363)
> crmd: [6852]: info: te_rsc_command: Initiating action 82: start p_fs_mountbind3_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=82:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind3_start_0 )
> lrmd: [6849]: info: rsc:p_fs_mountbind3 start[167] (pid 1364)
> lrmd: [6849]: info: operation monitor[166] on p_fs_mountbind2 for client 6852: pid 1363 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind2_monitor_10000 (call=166, rc=0, cib-update=266, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind2_monitor_10000 (81) confirmed on node2 (rc=0)
> Filesystem[1364]: [1423]: INFO: Running start for /mnt/storage/Mount2/ctl on /var/support/web/data/ctl
> lrmd: [6849]: info: RA output: (p_fs_mountbind3:start:stderr) FATAL: Module scsi_hostadapter not found.
> lrmd: [6849]: info: operation start[167] on p_fs_mountbind3 for client 6852: pid 1364 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind3_start_0 (call=167, rc=0, cib-update=267, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind3_start_0 (82) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 83: monitor p_fs_mountbind3_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=83:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_mountbind3_monitor_10000 )
> lrmd: [6849]: info: rsc:p_fs_mountbind3 monitor[168] (pid 1461)
> crmd: [6852]: info: te_rsc_command: Initiating action 84: start p_fs_varwww_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=84:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_varwww_start_0 )
> lrmd: [6849]: info: rsc:p_fs_varwww start[169] (pid 1463)
> lrmd: [6849]: info: operation monitor[168] on p_fs_mountbind3 for client 6852: pid 1461 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_mountbind3_monitor_10000 (call=168, rc=0, cib-update=268, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_mountbind3_monitor_10000 (83) confirmed on node2 (rc=0)
> Filesystem[1463]: [1534]: INFO: Running start for /mnt/storage/Mount2/var/www on /var/www
> lrmd: [6849]: info: RA output: (p_fs_varwww:start:stderr) FATAL: Module scsi_hostadapter not found.
> lrmd: [6849]: info: operation start[169] on p_fs_varwww for client 6852: pid 1463 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_varwww_start_0 (call=169, rc=0, cib-update=269, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_varwww_start_0 (84) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 85: monitor p_fs_varwww_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=85:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_fs_varwww_monitor_10000 )
> lrmd: [6849]: info: rsc:p_fs_varwww monitor[170] (pid 1562)
> crmd: [6852]: info: te_rsc_command: Initiating action 86: start p_apache_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=86:25:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_apache_start_0 )
> lrmd: [6849]: info: rsc:p_apache start[171] (pid 1566)
> lrmd: [6849]: info: operation monitor[170] on p_fs_varwww for client 6852: pid 1562 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_fs_varwww_monitor_10000 (call=170, rc=0, cib-update=270, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_fs_varwww_monitor_10000 (85) confirmed on node2 (rc=0)
> apache[1566]: [1643]: INFO: apache not running
> apache[1566]: [1646]: INFO: waiting for apache /etc/apache2/apache2.conf to come up
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.192.14): ok (rc=0)
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_fs_mount2'] (origin=node1/crmd/44, version=1.192.14): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_fs_mount2_last_0, magic=0:7;9:8:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.192.15) : Resource op removal
> crmd: [6852]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
> crmd: [6852]: info: update_abort_priority: Abort action done superceeded by restart
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.192.14): ok (rc=0)
> cib: last message repeated 2 times
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_fs_mount2'] (origin=node1/crmd/45, version=1.192.17): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_fs_mount2_last_0, magic=0:7;9:8:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.192.17) : Resource op removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=0, tag=diff, id=(null), magic=NA, cib=1.193.1) : Non-status change
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="192" num_updates="17" >
> cib: [6848]: info: cib:diff: - <configuration >
> cib: [6848]: info: cib:diff: - <mount2_config >
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327519062" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="193" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node2" update-client="crmd" cib-last-written="Wed Jan 25 13:17:42 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327519064" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=node1/crmd/47, version=1.193.1): ok (rc=0)
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> lrmd: [6849]: info: operation start[171] on p_apache for client 6852: pid 1566 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_apache_start_0 (call=171, rc=0, cib-update=272, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_apache_start_0 (86) confirmed on node2 (rc=0)
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 25 (Complete=19, Pending=0, Fired=0, Skipped=14, Incomplete=0, Source=/var/lib/pengine/pe-input-1007.bz2): Stopped
> crmd: [6852]: info: te_graph_trigger: Transition 25 is now complete
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 273: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=273, ref=pe_calc-dc-1327519065-230, seq=4, quorate=1
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999997 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount2:0 to remain promoted on node2
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount1:0 to remain promoted on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (40s) for p_apache on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_smb on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_nmb on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (30s) for p_lsb_nfsserver on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount1 on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount2 on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind3#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_varwww#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_apache#011(Started node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_smb#011(node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_nmb#011(node2)
> pengine: [29654]: notice: LogActions: Start p_lsb_nfsserver#011(node2)
> pengine: [29654]: notice: LogActions: Start p_exportfs_mount1#011(node2)
> pengine: [29654]: notice: LogActions: Start p_exportfs_mount2#011(node2)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> crmd: [6852]: info: unpack_graph: Unpacked transition 26: 16 actions in 16 synapses
> crmd: [6852]: info: do_te_invoke: Processing graph 26 (ref=pe_calc-dc-1327519065-230) derived from /var/lib/pengine/pe-input-1008.bz2
> crmd: [6852]: info: te_rsc_command: Initiating action 13: monitor p_fs_mount2_monitor_0 on node1
> crmd: [6852]: info: te_rsc_command: Initiating action 94: monitor p_apache_monitor_40000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=94:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_apache_monitor_40000 )
> lrmd: [6849]: info: rsc:p_apache monitor[172] (pid 1669)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 109 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 99: start p_lsb_smb_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=99:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_smb_start_0 )
> lrmd: [6849]: info: rsc:p_lsb_smb start[173] (pid 1670)
> lrmd: [1670]: WARN: For LSB init script, no additional parameters are needed.
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout) Rather than invoking init scripts through /etc/init.d, use the service(8)
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout) utility, e.g. service smbd start
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout)
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout) Since the script you are attempting to invoke has been converted to an
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout) Upstart job, you may also use the start(8) utility, e.g. start smbd
> lrmd: [6849]: info: RA output: (p_lsb_smb:start:stdout) smbd start/running, process 1690
> lrmd: [6849]: info: operation start[173] on p_lsb_smb for client 6852: pid 1670 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_smb_start_0 (call=173, rc=0, cib-update=274, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_lsb_smb_start_0 (99) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 100: monitor p_lsb_smb_monitor_30000 on node2 (local)
> lrmd: [6849]: info: rsc:p_lsb_smb monitor[174] (pid 1697)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=100:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_smb_monitor_30000 )
> crmd: [6852]: info: te_rsc_command: Initiating action 101: start p_lsb_nmb_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=101:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_nmb_start_0 )
> lrmd: [6849]: info: rsc:p_lsb_nmb start[175] (pid 1699)
> lrmd: [1699]: WARN: For LSB init script, no additional parameters are needed.
> lrmd: [6849]: info: operation monitor[174] on p_lsb_smb for client 6852: pid 1697 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_smb_monitor_30000 (call=174, rc=0, cib-update=275, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_lsb_smb_monitor_30000 (100) confirmed on node2 (rc=0)
> lrmd: [6849]: info: operation monitor[172] on p_apache for client 6852: pid 1669 exited with return code 0
> lrmd: [6849]: info: RA output: (p_lsb_nmb:start:stdout) Rather than invoking init scripts through /etc/init.d, use the service(8)#012utility, e.g. service nmbd start#012#012Since the script you are attempting to invoke has been converted to an#012Upstart job, you may also use the start(8) utility, e.g. start nmbd
> crmd: [6852]: info: process_lrm_event: LRM operation p_apache_monitor_40000 (call=172, rc=0, cib-update=276, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_apache_monitor_40000 (94) confirmed on node2 (rc=0)
> pengine: [29654]: notice: process_pe_message: Transition 26: PEngine Input stored in: /var/lib/pengine/pe-input-1008.bz2
> smbd[1690]: [2012/01/25 13:17:45, 0] smbd/server.c:1115(main)
> lrmd: [6849]: info: RA output: (p_lsb_nmb:start:stdout) nmbd start/running, process 1730
> lrmd: [6849]: info: operation start[175] on p_lsb_nmb for client 6852: pid 1699 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_nmb_start_0 (call=175, rc=0, cib-update=277, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_lsb_nmb_start_0 (101) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 102: monitor p_lsb_nmb_monitor_30000 on node2 (local)
> lrmd: [6849]: info: rsc:p_lsb_nmb monitor[176] (pid 1731)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=102:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_nmb_monitor_30000 )
> crmd: [6852]: info: te_rsc_command: Initiating action 103: start p_lsb_nfsserver_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=103:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_nfsserver_start_0 )
> smbd[1690]: standard input is not a socket, assuming -D option
> lrmd: [6849]: info: rsc:p_lsb_nfsserver start[177] (pid 1732)
> lrmd: [1732]: WARN: For LSB init script, no additional parameters are needed.
> lrmd: [6849]: info: operation monitor[176] on p_lsb_nmb for client 6852: pid 1731 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_nmb_monitor_30000 (call=176, rc=0, cib-update=278, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_lsb_nmb_monitor_30000 (102) confirmed on node2 (rc=0)
> lrmd: [6849]: info: RA output: (p_lsb_nfsserver:start:stdout) * Exporting directories for NFS kernel daemon...
> lrmd: [6849]: info: RA output: (p_lsb_nfsserver:start:stdout) ...done.#012 * Starting NFS kernel daemon
> kernel: [433321.967109] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
> kernel: [433321.967129] NFSD: unable to find recovery directory /var/lib/nfs/v4recovery
> kernel: [433321.967142] NFSD: starting 10-second grace period
> lrmd: [6849]: info: RA output: (p_lsb_nfsserver:start:stdout) ...done.
> lrmd: [6849]: info: operation start[177] on p_lsb_nfsserver for client 6852: pid 1732 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_nfsserver_start_0 (call=177, rc=0, cib-update=279, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_lsb_nfsserver_start_0 (103) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 104: monitor p_lsb_nfsserver_monitor_30000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=104:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_lsb_nfsserver_monitor_30000 )
> lrmd: [6849]: info: rsc:p_lsb_nfsserver monitor[178] (pid 1755)
> crmd: [6852]: info: te_rsc_command: Initiating action 105: start p_exportfs_mount1_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=105:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount1_start_0 )
> lrmd: [6849]: info: rsc:p_exportfs_mount1 start[179] (pid 1757)
> exportfs[1757]: [1762]: INFO: Directory /mnt/storage/Mount2 is not exported to 192.168.0.0/255.255.255.0 (stopped).
> lrmd: [6849]: info: operation monitor[178] on p_lsb_nfsserver for client 6852: pid 1755 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_lsb_nfsserver_monitor_30000 (call=178, rc=0, cib-update=280, confirmed=false) ok
> exportfs[1757]: [1764]: INFO: Exporting file system ...
> crmd: [6852]: info: match_graph_event: Action p_lsb_nfsserver_monitor_30000 (104) confirmed on node2 (rc=0)
> exportfs[1757]: [1771]: INFO: exporting 192.168.0.0/255.255.255.0:/mnt/storage/Mount2
> exportfs[1757]: [1775]: INFO: File system exported
> lrmd: [6849]: info: operation start[179] on p_exportfs_mount1 for client 6852: pid 1757 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_exportfs_mount1_start_0 (call=179, rc=0, cib-update=281, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_exportfs_mount1_start_0 (105) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 106: monitor p_exportfs_mount1_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=106:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount1_monitor_10000 )
> lrmd: [6849]: info: rsc:p_exportfs_mount1 monitor[180] (pid 1780)
> crmd: [6852]: info: te_rsc_command: Initiating action 107: start p_exportfs_mount2_start_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=107:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount2_start_0 )
> lrmd: [6849]: info: rsc:p_exportfs_mount2 start[181] (pid 1781)
> exportfs[1781]: [1791]: INFO: Directory /mnt/storage/Mount1 is not exported to 192.168.0.0/255.255.255.0 (stopped).
> exportfs[1780]: [1790]: INFO: Directory /mnt/storage/Mount2 is exported to 192.168.0.0/255.255.255.0 (started).
> exportfs[1781]: [1793]: INFO: Exporting file system ...
> exportfs[1781]: [1801]: INFO: exporting 192.168.0.0/255.255.255.0:/mnt/storage/Mount1
> exportfs[1781]: [1805]: INFO: File system exported
> lrmd: [6849]: info: operation start[181] on p_exportfs_mount2 for client 6852: pid 1781 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_exportfs_mount2_start_0 (call=181, rc=0, cib-update=282, confirmed=true) ok
> crmd: [6852]: info: match_graph_event: Action p_exportfs_mount2_start_0 (107) confirmed on node2 (rc=0)
> crmd: [6852]: info: te_pseudo_action: Pseudo action 110 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 108: monitor p_exportfs_mount2_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=108:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount2_monitor_10000 )
> lrmd: [6849]: info: rsc:p_exportfs_mount2 monitor[182] (pid 1806)
> exportfs[1806]: [1811]: INFO: Directory /mnt/storage/Mount1 is exported to 192.168.0.0/255.255.255.0 (started).
> crmd: [6852]: info: match_graph_event: Action p_fs_mount2_monitor_0 (13) confirmed on node1 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 12: probe_complete probe_complete on node1 - no waiting
> crmd: [6852]: info: te_pseudo_action: Pseudo action 11 fired and confirmed

> attrd: [6851]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-p_exportfs_mount2 (<null>)
> attrd: [6851]: notice: attrd_perform_update: Sent delete 648: node=039e53da-dce8-4fd7-84bc-7261682529e8, attr=fail-count-p_exportfs_mount2, id=<n/a>, set=(null), section=status
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_exportfs_mount2'] (origin=local/crmd/294, version=1.193.21): ok (rc=0)
> crmd: [6852]: info: delete_resource: Removing resource p_exportfs_mount2 for 5885_mount2_resource (internal) on node2
> crmd: [6852]: info: notify_deleted: Notifying 5885_mount2_resource on node2 that p_exportfs_mount2 was deleted
> crmd: [6852]: WARN: decode_transition_key: Bad UUID (mount2-resource-5885) in sscanf result (3) for 0:0:mount2-resource-5885
> crmd: [6852]: info: send_direct_ack: ACK'ing resource op p_exportfs_mount2_delete_60000 from 0:0:mount2-resource-5885: lrm_invoke-lrmd-1327519232-253
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node2']//lrm_resource[@id='p_exportfs_mount2'] (origin=local/crmd/295, version=1.193.22): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:178 - Triggered transition abort (complete=1, tag=transient_attributes, id=039e53da-dce8-4fd7-84bc-7261682529e8, magic=NA, cib=1.193.21) : Transient attribute: removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_exportfs_mount2_last_0, magic=0:0;107:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.193.22) : Resource op removal
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=p_exportfs_mount2_last_0, magic=0:0;107:26:0:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.193.22) : Resource op removal
> crmd: [6852]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 298: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 299: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke: Query 300: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="193" num_updates="22" >
> cib: [6848]: info: cib:diff: - <configuration >
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=1.194.1) : Non-status change
> cib: [6848]: info: cib:diff: - <mount2_config >
> crmd: [6852]: info: do_pe_invoke: Query 301: Requesting the current CIB: S_POLICY_ENGINE
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327519064" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="194" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node1" update-client="crmd" cib-last-written="Wed Jan 25 13:17:45 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327519232" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=local/crmd/297, version=1.194.1): ok (rc=0)
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999997 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: notice: common_apply_stickiness: p_exportfs_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount2:0 to remain promoted on node2
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=301, ref=pe_calc-dc-1327519232-254, seq=4, quorate=1
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount1:0 to remain promoted on node2
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount2 on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind3#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_varwww#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_apache#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Start p_exportfs_mount2#011(node2)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> crmd: [6852]: info: unpack_graph: Unpacked transition 34: 7 actions in 7 synapses
> crmd: [6852]: info: do_te_invoke: Processing graph 34 (ref=pe_calc-dc-1327519232-254) derived from /var/lib/pengine/pe-input-1016.bz2
> crmd: [6852]: info: te_pseudo_action: Pseudo action 113 fired and confirmed
> crmd: [6852]: info: te_rsc_command: Initiating action 19: monitor p_exportfs_mount2_monitor_0 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=19:34:7:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount2_monitor_0 )
> lrmd: [6849]: info: rsc:p_exportfs_mount2 probe[184] (pid 5887)
> exportfs[5887]: [5892]: INFO: Directory /mnt/storage/Mount1 is exported to 192.168.0.0/255.255.255.0s (started).
> pengine: [29654]: notice: process_pe_message: Transition 34: PEngine Input stored in: /var/lib/pengine/pe-input-1016.bz2
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.194.1): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_exportfs_mount2_last_0, magic=0:7;20:8:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.194.2) : Resource op removal
> crmd: [6852]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
> crmd: [6852]: info: update_abort_priority: Abort action done superceeded by restart
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_exportfs_mount2'] (origin=node1/crmd/51, version=1.194.1): ok (rc=0)
> exportfs[5894]: [5899]: INFO: Directory /mnt/storage/Mount2 is exported to 192.168.0.0/255.255.255.0 (started).
> cib: [6848]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=node1/node1/(null), version=1.194.1): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:291 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_exportfs_mount2_last_0, magic=0:7;20:8:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.194.2) : Resource op removal
> cib: [6848]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='node1']//lrm_resource[@id='p_exportfs_mount2'] (origin=node1/crmd/52, version=1.194.2): ok (rc=0)
> crmd: [6852]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=0, tag=diff, id=(null), magic=NA, cib=1.195.1) : Non-status change
> cib: [6848]: info: cib:diff: - <cib admin_epoch="1" epoch="194" num_updates="3" >
> cib: [6848]: info: cib:diff: - <configuration >
> cib: [6848]: info: cib:diff: - <mount2_config >
> cib: [6848]: info: cib:diff: - <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: - <nvpair value="1327519232" id="cib-bootstrap-options-last-lrm-refresh" />
> cib: [6848]: info: cib:diff: - </cluster_property_set>
> cib: [6848]: info: cib:diff: - </mount2_config>
> cib: [6848]: info: cib:diff: - </configuration>
> cib: [6848]: info: cib:diff: - </cib>
> cib: [6848]: info: cib:diff: + <cib epoch="195" num_updates="1" admin_epoch="1" validate-with="pacemaker-1.2" mount2_feature_set="3.0.5" update-origin="node2" update-client="crmd" cib-last-written="Wed Jan 25 13:20:32 2012" have-quorum="1" dc-uuid="039e53da-dce8-4fd7-84bc-7261682529e8" >
> cib: [6848]: info: cib:diff: + <configuration >
> cib: [6848]: info: cib:diff: + <mount2_config >
> cib: [6848]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
> cib: [6848]: info: cib:diff: + <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1327519233" />
> cib: [6848]: info: cib:diff: + </cluster_property_set>
> cib: [6848]: info: cib:diff: + </mount2_config>
> cib: [6848]: info: cib:diff: + </configuration>
> cib: [6848]: info: cib:diff: + </cib>
> cib: [6848]: info: cib_process_request: Operation complete: op cib_modify for section mount2_config (origin=node1/crmd/54, version=1.195.1): ok (rc=0)
> crmd: [6852]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
> crmd: [6852]: info: config_query_callback: Checking for expired actions every 900000ms
> lrmd: [6849]: info: operation monitor[184] on p_exportfs_mount2 for client 6852: pid 5887 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_exportfs_mount2_monitor_0 (call=184, rc=0, cib-update=304, confirmed=true) ok
> crmd: [6852]: WARN: status_from_rc: Action 19 (p_exportfs_mount2_monitor_0) on node2 failed (target: 7 vs. rc: 0): Error
> crmd: [6852]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=p_exportfs_mount2_last_failure_0, magic=0:0;19:34:7:94503998-9c03-42e2-adff-57e4cfbc79a2, cib=1.195.3) : Event failed
> crmd: [6852]: info: match_graph_event: Action p_exportfs_mount2_monitor_0 (19) confirmed on node2 (rc=4)
> crmd: [6852]: info: te_rsc_command: Initiating action 18: probe_complete probe_complete on node2 (local) - no waiting
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 34 (Complete=3, Pending=0, Fired=0, Skipped=4, Incomplete=0, Source=/var/lib/pengine/pe-input-1016.bz2): Stopped
> crmd: [6852]: info: te_graph_trigger: Transition 34 is now complete
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> crmd: [6852]: info: do_pe_invoke: Query 305: Requesting the current CIB: S_POLICY_ENGINE
> crmd: [6852]: info: do_pe_invoke_callback: Invoking the PE: query=305, ref=pe_calc-dc-1327519247-257, seq=4, quorate=1
> pengine: [29654]: notice: unpack_config: On loss of CCM Quorum: Ignore
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node1
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_nmb_last_failure_0 found resource p_lsb_nmb active on node2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_lsb_smb_last_failure_0 found resource p_lsb_smb active on node2
> pengine: [29654]: WARN: unpack_rsc_op: Processing failed op p_drbd_mount2:1_monitor_30000 on node2: not running (7)
> pengine: [29654]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node2
> pengine: [29654]: notice: unpack_rsc_op: Operation p_exportfs_mount2_last_failure_0 found resource p_exportfs_mount2 active on node2
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: common_apply_stickiness: ms_drbd_mount2 can fail 999997 more times on node2 before being forced off
> pengine: last message repeated 2 times
> pengine: [29654]: notice: common_apply_stickiness: p_exportfs_mount1 can fail 999999 more times on node2 before being forced off
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount2:0 to remain promoted on node2
> pengine: [29654]: notice: can_be_master: Forcing unmanaged master p_drbd_mount1:0 to remain promoted on node2
> pengine: [29654]: notice: RecurringOp: Start recurring monitor (10s) for p_exportfs_mount2 on node2
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount1:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:0#011(Master unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_drbd_mount2:1#011(Stopped unmanaged)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mount2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_ip_nfs#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind2#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_mountbind3#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_fs_varwww#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_apache#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_smb#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nmb#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_lsb_nfsserver#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount1#011(Started node2)
> pengine: [29654]: notice: LogActions: Leave p_exportfs_mount2#011(Started node2)
> crmd: [6852]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> crmd: [6852]: info: unpack_graph: Unpacked transition 35: 4 actions in 4 synapses
> crmd: [6852]: info: do_te_invoke: Processing graph 35 (ref=pe_calc-dc-1327519247-257) derived from /var/lib/pengine/pe-input-1017.bz2
> crmd: [6852]: info: te_rsc_command: Initiating action 113: monitor p_exportfs_mount2_monitor_10000 on node2 (local)
> crmd: [6852]: info: do_lrm_rsc_op: Performing key=113:35:0:94503998-9c03-42e2-adff-57e4cfbc79a2 op=p_exportfs_mount2_monitor_10000 )
> lrmd: [6849]: info: rsc:p_exportfs_mount2 monitor[185] (pid 6376)
> crmd: [6852]: info: te_rsc_command: Initiating action 18: monitor p_exportfs_mount2_monitor_0 on node1
> exportfs[6376]: [6381]: INFO: Directory /mnt/storage/Mount1 is exported to 192.168.0.0/255.255.255.0 (started).
> pengine: [29654]: notice: process_pe_message: Transition 35: PEngine Input stored in: /var/lib/pengine/pe-input-1017.bz2
> crmd: [6852]: info: match_graph_event: Action p_exportfs_mount2_monitor_0 (18) confirmed on node1 (rc=0)
> crmd: [6852]: info: te_rsc_command: Initiating action 17: probe_complete probe_complete on node1 - no waiting
> crmd: [6852]: info: te_pseudo_action: Pseudo action 16 fired and confirmed
> exportfs[6603]: [6608]: INFO: Directory /mnt/storage/Mount2 is exported to 192.168.0.0/255.255.255.0 (started).
> lrmd: [6849]: info: operation monitor[185] on p_exportfs_mount2 for client 6852: pid 6376 exited with return code 0
> crmd: [6852]: info: process_lrm_event: LRM operation p_exportfs_mount2_monitor_10000 (call=185, rc=0, cib-update=306, confirmed=false) ok
> crmd: [6852]: info: match_graph_event: Action p_exportfs_mount2_monitor_10000 (113) confirmed on node2 (rc=0)
> crmd: [6852]: info: run_graph: ====================================================
> crmd: [6852]: notice: run_graph: Transition 35 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-1017.bz2): Complete
> crmd: [6852]: info: te_graph_trigger: Transition 35 is now complete
> crmd: [6852]: info: notify_crmd: Transition 35 status: done - <null>
> crmd: [6852]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> crmd: [6852]: info: do_state_transition: Starting PEngine Recheck Timer
> exportfs[6909]: [6914]: INFO: Directory /mnt/storage/Mount1 is exported to 192.168.0.0/255.255.255.0 (started).
> exportfs[7358]: [7363]: INFO: Directory /mnt/storage/Mount2 is exported to 192.168.0.0/255.255.255.0 (started).
> exportfs[7424]: [7429]: INFO: Directory /mnt/storage/Mount1 is exported to 192.168.0.0/255.255.255.0 (started).
> ntpd[1213]: Listening on interface #10 eth0, 10.52.4.8#123 Enabled
> ntpd[1213]: new interface(s) found: waking up resolver

> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Attachments: ms_drbd_mount1.log (7.13 KB)


andreas at hastexo

Feb 1, 2012, 2:19 PM

Post #4 of 16 (1812 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

On 01/25/2012 08:58 PM, Andrew Martin wrote:
> Hello,
>
> Recently I finished configuring a two-node cluster with pacemaker 1.1.6
> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
> the following resources:
> - primitives for DRBD storage devices
> - primitives for mounting the filesystem on the DRBD storage
> - primitives for some mount binds
> - primitive for starting apache
> - primitives for starting samba and nfs servers (following instructions
> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)

not enough information ... please share at least your complete cluster
configuration

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> Perhaps this is best described through the output of crm_mon:
> Online: [ node1 node2 ]
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2 (unmanaged)
> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
> (unmanaged) FAILED
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
> (unmanaged) FAILED
> Slaves: [ node2 ]
> Resource Group: g_core
> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
> Resource Group: g_apache
> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
> p_apache (ocf::heartbeat:apache): Started node1
> Resource Group: g_fileservers
> p_lsb_smb (lsb:smbd): Started node1
> p_lsb_nmb (lsb:nmbd): Started node1
> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
>
> I have read through the Pacemaker Explained
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
> documentation, however could not find a way to further debug these
> problems. First, I put node1 into standby mode to attempt failover to
> the other node (node2). Node2 appeared to start the transition to
> master, however it failed to promote the DRBD resources to master (the
> first step). I have attached a copy of this session in commands.log and
> additional excerpts from /var/log/syslog during important steps. I have
> attempted everything I can think of to try and start the DRBD resource
> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
> heartbeat) but cannot bring it out of the slave state. However, if I set
> it to unmanaged and then run drbdadm primary all in the terminal,
> pacemaker is satisfied and continues starting the rest of the resources.
> It then failed when attempting to mount the filesystem for mount2, the
> p_fs_mount2 resource. I attempted to mount the filesystem myself and was
> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
> it mounted. The rest of the resources started as expected until the
> p_exportfs_mount2 resource, which failed as follows:
> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
> (unmanaged) FAILED
>
> I ran cleanup on this and it started, however when running this test
> earlier today no command could successfully start this exportfs resource.
>
> How can I configure pacemaker to better resolve these problems and be
> able to bring the node up successfully on its own? What can I check to
> determine why these failures are occuring? /var/log/syslog did not seem
> to contain very much useful information regarding why the failures occurred.
>
> Thanks,
>
> Andrew
>
>
>
>
> This body part will be downloaded on demand.
Attachments: signature.asc (0.22 KB)


amartin at xes-inc

Mar 24, 2012, 12:15 PM

Post #5 of 16 (1692 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

Hi Andreas,


My complete cluster configuration is as follows:

============
Last updated: Sat Mar 24 13:51:55 2012
Last change: Sat Mar 24 13:41:55 2012
Stack: Heartbeat
Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
3 Nodes configured, unknown expected votes
19 Resources configured.
============


Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
Online: [ node2 node1 ]


Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
Masters: [ node2 ]
Slaves: [ node1 ]
Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
Masters: [ node2 ]
Slaves: [ node1 ]
Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
Masters: [ node2 ]
Slaves: [ node1 ]
Resource Group: g_vm
p_fs_vmstore (ocf::heartbeat:Filesystem): Started node2
p_vm (ocf::heartbeat:VirtualDomain): Started node2
Clone Set: cl_daemons [g_daemons]
Started: [ node2 node1 ]
Stopped: [ g_daemons:2 ]
Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
Started: [ node2 node1 ]
Stopped: [ p_sysadmin_notify:2 ]
stonith-node1 (stonith:external/tripplitepdu): Started node2
stonith-node2 (stonith:external/tripplitepdu): Started node1
Clone Set: cl_ping [p_ping]
Started: [ node2 node1 ]
Stopped: [ p_ping:2 ]



node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
attributes standby="off"
node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
attributes standby="off"
node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
attributes standby="on"
primitive p_drbd_mount2 ocf:linbit:drbd \
params drbd_resource="mount2" \
op monitor interval="15" role="Master" \
op monitor interval="30" role="Slave"
primitive p_drbd_mount1 ocf:linbit:drbd \
params drbd_resource="mount1" \
op monitor interval="15" role="Master" \
op monitor interval="30" role="Slave"
primitive p_drbd_vmstore ocf:linbit:drbd \
params drbd_resource="vmstore" \
op monitor interval="15" role="Master" \
op monitor interval="30" role="Slave"
primitive p_fs_vmstore ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s" \
op monitor interval="20s" timeout="40s"
primitive p_libvirt-bin upstart:libvirt-bin \
op monitor interval="30"
primitive p_ping ocf:pacemaker:ping \
params name="p_ping" host_list="192.168.1.10 192.168.1.11" multiplier="1000" \
op monitor interval="20s"
primitive p_sysadmin_notify ocf:heartbeat:MailTo \
params email="me [at] example" \
params subject="Pacemaker Change" \
op start interval="0" timeout="10" \
op stop interval="0" timeout="10" \
op monitor interval="10" timeout="10"
primitive p_vm ocf:heartbeat:VirtualDomain \
params config="/vmstore/config/vm.xml" \
meta allow-migrate="false" \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
op monitor interval="10" timeout="30"
primitive stonith-node1 stonith:external/tripplitepdu \
params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx" pdu_password="xxx" hostname_to_stonith="node1"
primitive stonith-node2 stonith:external/tripplitepdu \
params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx" pdu_password="xxx" hostname_to_stonith="node2"
group g_daemons p_libvirt-bin
group g_vm p_fs_vmstore p_vm
ms ms_drbd_mount2 p_drbd_mount2 \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
ms ms_drbd_mount1 p_drbd_mount1 \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
ms ms_drbd_vmstore p_drbd_vmstore \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
clone cl_daemons g_daemons
clone cl_ping p_ping \
meta interleave="true"
clone cl_sysadmin_notify p_sysadmin_notify
location l-st-node1 stonith-node1 -inf: node1
location l-st-node2 stonith-node2 -inf: node2
location l_run_on_most_connected p_vm \
rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote ms_drbd_mount2:promote cl_daemons:start g_vm:start
property $id="cib-bootstrap-options" \
dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
no-quorum-policy="stop" \
last-lrm-refresh="1332539900" \
cluster-recheck-interval="5m" \
crmd-integration-timeout="3m" \
shutdown-escalation="5m"


The STONITH plugin is a custom plugin I wrote for the Tripp-Lite PDUMH20ATNET that I'm using as the STONITH device:
http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf


As you can see, I left the DRBD service to be started by the operating system (as an lsb script at boot time) however Pacemaker controls actually bringing up/taking down the individual DRBD devices. The behavior I observe is as follows: I issue "crm resource migrate p_vm" on node1 and failover successfully to node2. During this time, node2 fences node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile node2's DRBD devices are UpToDate. I then shutdown both nodes and then bring them back up. They reconnect to the cluster (with quorum), and node1's DRBD devices are still Outdated as expected and node2's DRBD devices are still UpToDate, as expected. At this point, DRBD starts on both nodes, however node2 will not set DRBD as master:

Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
Online: [ node2 node1 ]


Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
Slaves: [ node1 node2 ]
Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
Slaves: [ node1 node 2 ]
Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
Slaves: [ node1 node2 ]


I am having trouble sorting through the logging information because there is so much of it in /var/log/daemon.log, but I can't find an error message printed about why it will not promote node2. At this point the DRBD devices are as follows:
node2: cstate = WFConnection dstate=UpToDate
node1: cstate = StandAlone dstate=Outdated


I don't see any reason why node2 can't become DRBD master, or am I missing something? If I do "drbdadm connect all" on node1, then the cstate on both nodes changes to "Connected" and node2 immediately promotes the DRBD resources to master. Any ideas on why I'm observing this incorrect behavior?


Any tips on how I can better filter through the pacemaker/heartbeat logs or how to get additional useful debug information?


Thanks,


Andrew
----- Original Message -----

From: "Andreas Kurz" <andreas [at] hastexo>
To: pacemaker [at] oss
Sent: Wednesday, 1 February, 2012 4:19:25 PM
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover

On 01/25/2012 08:58 PM, Andrew Martin wrote:
> Hello,
>
> Recently I finished configuring a two-node cluster with pacemaker 1.1.6
> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
> the following resources:
> - primitives for DRBD storage devices
> - primitives for mounting the filesystem on the DRBD storage
> - primitives for some mount binds
> - primitive for starting apache
> - primitives for starting samba and nfs servers (following instructions
> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)

not enough information ... please share at least your complete cluster
configuration

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> Perhaps this is best described through the output of crm_mon:
> Online: [ node1 node2 ]
>
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2 (unmanaged)
> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
> (unmanaged) FAILED
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
> (unmanaged) FAILED
> Slaves: [ node2 ]
> Resource Group: g_core
> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
> Resource Group: g_apache
> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
> p_apache (ocf::heartbeat:apache): Started node1
> Resource Group: g_fileservers
> p_lsb_smb (lsb:smbd): Started node1
> p_lsb_nmb (lsb:nmbd): Started node1
> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
>
> I have read through the Pacemaker Explained
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
> documentation, however could not find a way to further debug these
> problems. First, I put node1 into standby mode to attempt failover to
> the other node (node2). Node2 appeared to start the transition to
> master, however it failed to promote the DRBD resources to master (the
> first step). I have attached a copy of this session in commands.log and
> additional excerpts from /var/log/syslog during important steps. I have
> attempted everything I can think of to try and start the DRBD resource
> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
> heartbeat) but cannot bring it out of the slave state. However, if I set
> it to unmanaged and then run drbdadm primary all in the terminal,
> pacemaker is satisfied and continues starting the rest of the resources.
> It then failed when attempting to mount the filesystem for mount2, the
> p_fs_mount2 resource. I attempted to mount the filesystem myself and was
> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
> it mounted. The rest of the resources started as expected until the
> p_exportfs_mount2 resource, which failed as follows:
> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
> (unmanaged) FAILED
>
> I ran cleanup on this and it started, however when running this test
> earlier today no command could successfully start this exportfs resource.
>
> How can I configure pacemaker to better resolve these problems and be
> able to bring the node up successfully on its own? What can I check to
> determine why these failures are occuring? /var/log/syslog did not seem
> to contain very much useful information regarding why the failures occurred.
>
> Thanks,
>
> Andrew
>
>
>
>
> This body part will be downloaded on demand.





_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


david at davidcoulson

Mar 24, 2012, 12:35 PM

Post #6 of 16 (1691 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

Shutdown pacemaker and fix your drbd disk first. Get them both
uptodate/uptodate and make sure you can manually switch them to primary
on each node.

Node2 can't become primary when it's not connected to something with an
uptodate disk.

On 3/24/12 3:15 PM, Andrew Martin wrote:
> Hi Andreas,
>
> My complete cluster configuration is as follows:
> ============
> Last updated: Sat Mar 24 13:51:55 2012
> Last change: Sat Mar 24 13:41:55 2012
> Stack: Heartbeat
> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
> with quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 3 Nodes configured, unknown expected votes
> 19 Resources configured.
> ============
>
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
> Online: [ node2 node1 ]
>
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Resource Group: g_vm
> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
> p_vm(ocf::heartbeat:VirtualDomain):Started node2
> Clone Set: cl_daemons [g_daemons]
> Started: [ node2 node1 ]
> Stopped: [ g_daemons:2 ]
> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
> Started: [ node2 node1 ]
> Stopped: [ p_sysadmin_notify:2 ]
> stonith-node1(stonith:external/tripplitepdu):Started node2
> stonith-node2(stonith:external/tripplitepdu):Started node1
> Clone Set: cl_ping [p_ping]
> Started: [ node2 node1 ]
> Stopped: [ p_ping:2 ]
>
> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
> attributes standby="off"
> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
> attributes standby="off"
> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
> attributes standby="on"
> primitive p_drbd_mount2 ocf:linbit:drbd \
> params drbd_resource="mount2" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_mount1 ocf:linbit:drbd \
> params drbd_resource="mount1" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_vmstore ocf:linbit:drbd \
> params drbd_resource="vmstore" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
> op start interval="0" timeout="60s" \
> op stop interval="0" timeout="60s" \
> op monitor interval="20s" timeout="40s"
> primitive p_libvirt-bin upstart:libvirt-bin \
> op monitor interval="30"
> primitive p_ping ocf:pacemaker:ping \
> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
> multiplier="1000" \
> op monitor interval="20s"
> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
> params email="me [at] example" \
> params subject="Pacemaker Change" \
> op start interval="0" timeout="10" \
> op stop interval="0" timeout="10" \
> op monitor interval="10" timeout="10"
> primitive p_vm ocf:heartbeat:VirtualDomain \
> params config="/vmstore/config/vm.xml" \
> meta allow-migrate="false" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s" \
> op monitor interval="10" timeout="30"
> primitive stonith-node1 stonith:external/tripplitepdu \
> params pdu_ipaddr="192.168.1.12" pdu_port="1"
> pdu_username="xxx" pdu_password="xxx" hostname_to_stonith="node1"
> primitive stonith-node2 stonith:external/tripplitepdu \
> params pdu_ipaddr="192.168.1.12" pdu_port="2"
> pdu_username="xxx" pdu_password="xxx" hostname_to_stonith="node2"
> group g_daemons p_libvirt-bin
> group g_vm p_fs_vmstore p_vm
> ms ms_drbd_mount2 p_drbd_mount2 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_mount1 p_drbd_mount1 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_vmstore p_drbd_vmstore \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> clone cl_daemons g_daemons
> clone cl_ping p_ping \
> meta interleave="true"
> clone cl_sysadmin_notify p_sysadmin_notify
> location l-st-node1 stonith-node1 -inf: node1
> location l-st-node2 stonith-node2 -inf: node2
> location l_run_on_most_connected p_vm \
> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
> ms_drbd_mount2:promote cl_daemons:start g_vm:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false" \
> no-quorum-policy="stop" \
> last-lrm-refresh="1332539900" \
> cluster-recheck-interval="5m" \
> crmd-integration-timeout="3m" \
> shutdown-escalation="5m"
>
> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
> PDUMH20ATNET that I'm using as the STONITH device:
> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>
> As you can see, I left the DRBD service to be started by the operating
> system (as an lsb script at boot time) however Pacemaker controls
> actually bringing up/taking down the individual DRBD devices. The
> behavior I observe is as follows: I issue "crm resource migrate p_vm"
> on node1 and failover successfully to node2. During this time, node2
> fences node1's DRBD devices (using dopd) and marks them as Outdated.
> Meanwhile node2's DRBD devices are UpToDate. I then shutdown both
> nodes and then bring them back up. They reconnect to the cluster (with
> quorum), and node1's DRBD devices are still Outdated as expected and
> node2's DRBD devices are still UpToDate, as expected. At this point,
> DRBD starts on both nodes, however node2 will not set DRBD as master:
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
> Online: [ node2 node1 ]
>
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> Slaves: [ node1 node2 ]
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> Slaves: [ node1 node 2 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Slaves: [ node1 node2 ]
>
> I am having trouble sorting through the logging information because
> there is so much of it in /var/log/daemon.log, but I can't find an
> error message printed about why it will not promote node2. At this
> point the DRBD devices are as follows:
> node2: cstate = WFConnection dstate=UpToDate
> node1: cstate = StandAlone dstate=Outdated
>
> I don't see any reason why node2 can't become DRBD master, or am I
> missing something? If I do "drbdadm connect all" on node1, then the
> cstate on both nodes changes to "Connected" and node2 immediately
> promotes the DRBD resources to master. Any ideas on why I'm observing
> this incorrect behavior?
>
> Any tips on how I can better filter through the pacemaker/heartbeat
> logs or how to get additional useful debug information?
>
> Thanks,
>
> Andrew
>
> ------------------------------------------------------------------------
> *From: *"Andreas Kurz" <andreas [at] hastexo>
> *To: *pacemaker [at] oss
> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 01/25/2012 08:58 PM, Andrew Martin wrote:
> > Hello,
> >
> > Recently I finished configuring a two-node cluster with pacemaker 1.1.6
> > and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
> > the following resources:
> > - primitives for DRBD storage devices
> > - primitives for mounting the filesystem on the DRBD storage
> > - primitives for some mount binds
> > - primitive for starting apache
> > - primitives for starting samba and nfs servers (following instructions
> > here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
> > - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>
> not enough information ... please share at least your complete cluster
> configuration
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
> >
> > Perhaps this is best described through the output of crm_mon:
> > Online: [ node1 node2 ]
> >
> > Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> > p_drbd_mount1:0 (ocf::linbit:drbd): Started node2
> (unmanaged)
> > p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
> > (unmanaged) FAILED
> > Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> > p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
> > (unmanaged) FAILED
> > Slaves: [ node2 ]
> > Resource Group: g_core
> > p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
> > p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
> > p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
> > Resource Group: g_apache
> > p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
> > p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
> > p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
> > p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
> > p_apache (ocf::heartbeat:apache): Started node1
> > Resource Group: g_fileservers
> > p_lsb_smb (lsb:smbd): Started node1
> > p_lsb_nmb (lsb:nmbd): Started node1
> > p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
> > p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
> > p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
> >
> > I have read through the Pacemaker Explained
> >
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
> > documentation, however could not find a way to further debug these
> > problems. First, I put node1 into standby mode to attempt failover to
> > the other node (node2). Node2 appeared to start the transition to
> > master, however it failed to promote the DRBD resources to master (the
> > first step). I have attached a copy of this session in commands.log and
> > additional excerpts from /var/log/syslog during important steps. I have
> > attempted everything I can think of to try and start the DRBD resource
> > (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
> > heartbeat) but cannot bring it out of the slave state. However, if I set
> > it to unmanaged and then run drbdadm primary all in the terminal,
> > pacemaker is satisfied and continues starting the rest of the resources.
> > It then failed when attempting to mount the filesystem for mount2, the
> > p_fs_mount2 resource. I attempted to mount the filesystem myself and was
> > successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
> > it mounted. The rest of the resources started as expected until the
> > p_exportfs_mount2 resource, which failed as follows:
> > p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
> > (unmanaged) FAILED
> >
> > I ran cleanup on this and it started, however when running this test
> > earlier today no command could successfully start this exportfs
> resource.
> >
> > How can I configure pacemaker to better resolve these problems and be
> > able to bring the node up successfully on its own? What can I check to
> > determine why these failures are occuring? /var/log/syslog did not seem
> > to contain very much useful information regarding why the failures
> occurred.
> >
> > Thanks,
> >
> > Andrew
> >
> >
> >
> >
> > This body part will be downloaded on demand.
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


emi2fast at gmail

Mar 24, 2012, 2:31 PM

Post #7 of 16 (1691 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

Hello

I think this constrains it's wrong

======================================
colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
======================================

it must be like this
======================================
colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master
ms_drbd_mount1:Master ms_drbd_mount2:Master

Il giorno 24 marzo 2012 20:15, Andrew Martin <amartin [at] xes-inc> ha
scritto:

> Hi Andreas,
>
> My complete cluster configuration is as follows:
> ============
> Last updated: Sat Mar 24 13:51:55 2012
> Last change: Sat Mar 24 13:41:55 2012
> Stack: Heartbeat
> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition with
> quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 3 Nodes configured, unknown expected votes
> 19 Resources configured.
> ============
>
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
> Online: [ node2 node1 ]
>
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Resource Group: g_vm
> p_fs_vmstore (ocf::heartbeat:Filesystem): Started node2
> p_vm (ocf::heartbeat:VirtualDomain): Started node2
> Clone Set: cl_daemons [g_daemons]
> Started: [ node2 node1 ]
> Stopped: [ g_daemons:2 ]
> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
> Started: [ node2 node1 ]
> Stopped: [ p_sysadmin_notify:2 ]
> stonith-node1 (stonith:external/tripplitepdu): Started node2
> stonith-node2 (stonith:external/tripplitepdu): Started node1
> Clone Set: cl_ping [p_ping]
> Started: [ node2 node1 ]
> Stopped: [ p_ping:2 ]
>
> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
> attributes standby="off"
> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
> attributes standby="off"
> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
> attributes standby="on"
> primitive p_drbd_mount2 ocf:linbit:drbd \
> params drbd_resource="mount2" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_mount1 ocf:linbit:drbd \
> params drbd_resource="mount1" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_vmstore ocf:linbit:drbd \
> params drbd_resource="vmstore" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
> op start interval="0" timeout="60s" \
> op stop interval="0" timeout="60s" \
> op monitor interval="20s" timeout="40s"
> primitive p_libvirt-bin upstart:libvirt-bin \
> op monitor interval="30"
> primitive p_ping ocf:pacemaker:ping \
> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
> multiplier="1000" \
> op monitor interval="20s"
> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
> params email="me [at] example" \
> params subject="Pacemaker Change" \
> op start interval="0" timeout="10" \
> op stop interval="0" timeout="10" \
> op monitor interval="10" timeout="10"
> primitive p_vm ocf:heartbeat:VirtualDomain \
> params config="/vmstore/config/vm.xml" \
> meta allow-migrate="false" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s" \
> op monitor interval="10" timeout="30"
> primitive stonith-node1 stonith:external/tripplitepdu \
> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"
> pdu_password="xxx" hostname_to_stonith="node1"
> primitive stonith-node2 stonith:external/tripplitepdu \
> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"
> pdu_password="xxx" hostname_to_stonith="node2"
> group g_daemons p_libvirt-bin
> group g_vm p_fs_vmstore p_vm
> ms ms_drbd_mount2 p_drbd_mount2 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_mount1 p_drbd_mount1 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_vmstore p_drbd_vmstore \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> clone cl_daemons g_daemons
> clone cl_ping p_ping \
> meta interleave="true"
> clone cl_sysadmin_notify p_sysadmin_notify
> location l-st-node1 stonith-node1 -inf: node1
> location l-st-node2 stonith-node2 -inf: node2
> location l_run_on_most_connected p_vm \
> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
> ms_drbd_mount2:promote cl_daemons:start g_vm:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false" \
> no-quorum-policy="stop" \
> last-lrm-refresh="1332539900" \
> cluster-recheck-interval="5m" \
> crmd-integration-timeout="3m" \
> shutdown-escalation="5m"
>
> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
> PDUMH20ATNET that I'm using as the STONITH device:
> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>
> As you can see, I left the DRBD service to be started by the operating
> system (as an lsb script at boot time) however Pacemaker controls actually
> bringing up/taking down the individual DRBD devices. The behavior I observe
> is as follows: I issue "crm resource migrate p_vm" on node1 and failover
> successfully to node2. During this time, node2 fences node1's DRBD devices
> (using dopd) and marks them as Outdated. Meanwhile node2's DRBD devices are
> UpToDate. I then shutdown both nodes and then bring them back up. They
> reconnect to the cluster (with quorum), and node1's DRBD devices are still
> Outdated as expected and node2's DRBD devices are still UpToDate, as
> expected. At this point, DRBD starts on both nodes, however node2 will not
> set DRBD as master:
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
> Online: [ node2 node1 ]
>
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> Slaves: [ node1 node2 ]
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> Slaves: [ node1 node 2 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Slaves: [ node1 node2 ]
>
> I am having trouble sorting through the logging information because there
> is so much of it in /var/log/daemon.log, but I can't find an error message
> printed about why it will not promote node2. At this point the DRBD devices
> are as follows:
> node2: cstate = WFConnection dstate=UpToDate
> node1: cstate = StandAlone dstate=Outdated
>
> I don't see any reason why node2 can't become DRBD master, or am I missing
> something? If I do "drbdadm connect all" on node1, then the cstate on both
> nodes changes to "Connected" and node2 immediately promotes the DRBD
> resources to master. Any ideas on why I'm observing this incorrect behavior?
>
> Any tips on how I can better filter through the pacemaker/heartbeat logs
> or how to get additional useful debug information?
>
> Thanks,
>
> Andrew
>
> ------------------------------
> *From: *"Andreas Kurz" <andreas [at] hastexo>
> *To: *pacemaker [at] oss
> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 01/25/2012 08:58 PM, Andrew Martin wrote:
> > Hello,
> >
> > Recently I finished configuring a two-node cluster with pacemaker 1.1.6
> > and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
> > the following resources:
> > - primitives for DRBD storage devices
> > - primitives for mounting the filesystem on the DRBD storage
> > - primitives for some mount binds
> > - primitive for starting apache
> > - primitives for starting samba and nfs servers (following instructions
> > here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
> > - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>
> not enough information ... please share at least your complete cluster
> configuration
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
> >
> > Perhaps this is best described through the output of crm_mon:
> > Online: [ node1 node2 ]
> >
> > Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> > p_drbd_mount1:0 (ocf::linbit:drbd): Started node2
> (unmanaged)
> > p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
> > (unmanaged) FAILED
> > Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> > p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
> > (unmanaged) FAILED
> > Slaves: [ node2 ]
> > Resource Group: g_core
> > p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
> > p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
> > p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
> > Resource Group: g_apache
> > p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
> > p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
> > p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
> > p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
> > p_apache (ocf::heartbeat:apache): Started node1
> > Resource Group: g_fileservers
> > p_lsb_smb (lsb:smbd): Started node1
> > p_lsb_nmb (lsb:nmbd): Started node1
> > p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
> > p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
> > p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
> >
> > I have read through the Pacemaker Explained
> > <
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained
> >
> > documentation, however could not find a way to further debug these
> > problems. First, I put node1 into standby mode to attempt failover to
> > the other node (node2). Node2 appeared to start the transition to
> > master, however it failed to promote the DRBD resources to master (the
> > first step). I have attached a copy of this session in commands.log and
> > additional excerpts from /var/log/syslog during important steps. I have
> > attempted everything I can think of to try and start the DRBD resource
> > (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
> > heartbeat) but cannot bring it out of the slave state. However, if I set
> > it to unmanaged and then run drbdadm primary all in the terminal,
> > pacemaker is satisfied and continues starting the rest of the resources.
> > It then failed when attempting to mount the filesystem for mount2, the
> > p_fs_mount2 resource. I attempted to mount the filesystem myself and was
> > successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
> > it mounted. The rest of the resources started as expected until the
> > p_exportfs_mount2 resource, which failed as follows:
> > p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
> > (unmanaged) FAILED
> >
> > I ran cleanup on this and it started, however when running this test
> > earlier today no command could successfully start this exportfs
> resource.
> >
> > How can I configure pacemaker to better resolve these problems and be
> > able to bring the node up successfully on its own? What can I check to
> > determine why these failures are occuring? /var/log/syslog did not seem
> > to contain very much useful information regarding why the failures
> occurred.
> >
> > Thanks,
> >
> > Andrew
> >
> >
> >
> >
> > This body part will be downloaded on demand.
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


--
esta es mi vida e me la vivo hasta que dios quiera


andreas at hastexo

Mar 26, 2012, 3:56 PM

Post #8 of 16 (1680 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

On 03/24/2012 08:15 PM, Andrew Martin wrote:
> Hi Andreas,
>
> My complete cluster configuration is as follows:
> ============
> Last updated: Sat Mar 24 13:51:55 2012
> Last change: Sat Mar 24 13:41:55 2012
> Stack: Heartbeat
> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
> with quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 3 Nodes configured, unknown expected votes
> 19 Resources configured.
> ============
>
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
> Online: [ node2 node1 ]
>
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Resource Group: g_vm
> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
> p_vm(ocf::heartbeat:VirtualDomain):Started node2
> Clone Set: cl_daemons [g_daemons]
> Started: [ node2 node1 ]
> Stopped: [ g_daemons:2 ]
> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
> Started: [ node2 node1 ]
> Stopped: [ p_sysadmin_notify:2 ]
> stonith-node1(stonith:external/tripplitepdu):Started node2
> stonith-node2(stonith:external/tripplitepdu):Started node1
> Clone Set: cl_ping [p_ping]
> Started: [ node2 node1 ]
> Stopped: [ p_ping:2 ]
>
> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
> attributes standby="off"
> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
> attributes standby="off"
> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
> attributes standby="on"
> primitive p_drbd_mount2 ocf:linbit:drbd \
> params drbd_resource="mount2" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_mount1 ocf:linbit:drbd \
> params drbd_resource="mount1" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_vmstore ocf:linbit:drbd \
> params drbd_resource="vmstore" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
> op start interval="0" timeout="60s" \
> op stop interval="0" timeout="60s" \
> op monitor interval="20s" timeout="40s"
> primitive p_libvirt-bin upstart:libvirt-bin \
> op monitor interval="30"
> primitive p_ping ocf:pacemaker:ping \
> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
> multiplier="1000" \
> op monitor interval="20s"
> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
> params email="me [at] example" \
> params subject="Pacemaker Change" \
> op start interval="0" timeout="10" \
> op stop interval="0" timeout="10" \
> op monitor interval="10" timeout="10"
> primitive p_vm ocf:heartbeat:VirtualDomain \
> params config="/vmstore/config/vm.xml" \
> meta allow-migrate="false" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s" \
> op monitor interval="10" timeout="30"
> primitive stonith-node1 stonith:external/tripplitepdu \
> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"
> pdu_password="xxx" hostname_to_stonith="node1"
> primitive stonith-node2 stonith:external/tripplitepdu \
> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"
> pdu_password="xxx" hostname_to_stonith="node2"
> group g_daemons p_libvirt-bin
> group g_vm p_fs_vmstore p_vm
> ms ms_drbd_mount2 p_drbd_mount2 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_mount1 p_drbd_mount1 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_vmstore p_drbd_vmstore \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> clone cl_daemons g_daemons
> clone cl_ping p_ping \
> meta interleave="true"
> clone cl_sysadmin_notify p_sysadmin_notify
> location l-st-node1 stonith-node1 -inf: node1
> location l-st-node2 stonith-node2 -inf: node2
> location l_run_on_most_connected p_vm \
> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm

As Emmanuel already said, g_vm has to be in the first place in this
collocation constraint .... g_vm must be colocated with the drbd masters.

> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
> ms_drbd_mount2:promote cl_daemons:start g_vm:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false" \
> no-quorum-policy="stop" \
> last-lrm-refresh="1332539900" \
> cluster-recheck-interval="5m" \
> crmd-integration-timeout="3m" \
> shutdown-escalation="5m"
>
> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
> PDUMH20ATNET that I'm using as the STONITH device:
> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf

And why don't using it? .... stonith-enabled="false"

>
> As you can see, I left the DRBD service to be started by the operating
> system (as an lsb script at boot time) however Pacemaker controls
> actually bringing up/taking down the individual DRBD devices.

Don't start drbd on system boot, give Pacemaker the full control.

The
> behavior I observe is as follows: I issue "crm resource migrate p_vm" on
> node1 and failover successfully to node2. During this time, node2 fences
> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile
> node2's DRBD devices are UpToDate. I then shutdown both nodes and then
> bring them back up. They reconnect to the cluster (with quorum), and
> node1's DRBD devices are still Outdated as expected and node2's DRBD
> devices are still UpToDate, as expected. At this point, DRBD starts on
> both nodes, however node2 will not set DRBD as master:
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
> Online: [ node2 node1 ]
>
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> Slaves: [ node1 node2 ]
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> Slaves: [ node1 node 2 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Slaves: [ node1 node2 ]

There should really be no interruption of the drbd replication on vm
migration that activates the dopd ... drbd has its own direct network
connection?

Please share your ha.cf file and your drbd configuration. Watch out for
drbd messages in your kernel log file, that should give you additional
information when/why the drbd connection was lost.

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> I am having trouble sorting through the logging information because
> there is so much of it in /var/log/daemon.log, but I can't find an
> error message printed about why it will not promote node2. At this point
> the DRBD devices are as follows:
> node2: cstate = WFConnection dstate=UpToDate
> node1: cstate = StandAlone dstate=Outdated
>
> I don't see any reason why node2 can't become DRBD master, or am I
> missing something? If I do "drbdadm connect all" on node1, then the
> cstate on both nodes changes to "Connected" and node2 immediately
> promotes the DRBD resources to master. Any ideas on why I'm observing
> this incorrect behavior?
>
> Any tips on how I can better filter through the pacemaker/heartbeat logs
> or how to get additional useful debug information?
>
> Thanks,
>
> Andrew
>
> ------------------------------------------------------------------------
> *From: *"Andreas Kurz" <andreas [at] hastexo>
> *To: *pacemaker [at] oss
> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 01/25/2012 08:58 PM, Andrew Martin wrote:
>> Hello,
>>
>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6
>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
>> the following resources:
>> - primitives for DRBD storage devices
>> - primitives for mounting the filesystem on the DRBD storage
>> - primitives for some mount binds
>> - primitive for starting apache
>> - primitives for starting samba and nfs servers (following instructions
>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>
> not enough information ... please share at least your complete cluster
> configuration
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>>
>> Perhaps this is best described through the output of crm_mon:
>> Online: [ node1 node2 ]
>>
>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2 (unmanaged)
>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
>> (unmanaged) FAILED
>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
>> (unmanaged) FAILED
>> Slaves: [ node2 ]
>> Resource Group: g_core
>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
>> Resource Group: g_apache
>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
>> p_apache (ocf::heartbeat:apache): Started node1
>> Resource Group: g_fileservers
>> p_lsb_smb (lsb:smbd): Started node1
>> p_lsb_nmb (lsb:nmbd): Started node1
>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
>>
>> I have read through the Pacemaker Explained
>>
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
>> documentation, however could not find a way to further debug these
>> problems. First, I put node1 into standby mode to attempt failover to
>> the other node (node2). Node2 appeared to start the transition to
>> master, however it failed to promote the DRBD resources to master (the
>> first step). I have attached a copy of this session in commands.log and
>> additional excerpts from /var/log/syslog during important steps. I have
>> attempted everything I can think of to try and start the DRBD resource
>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
>> heartbeat) but cannot bring it out of the slave state. However, if I set
>> it to unmanaged and then run drbdadm primary all in the terminal,
>> pacemaker is satisfied and continues starting the rest of the resources.
>> It then failed when attempting to mount the filesystem for mount2, the
>> p_fs_mount2 resource. I attempted to mount the filesystem myself and was
>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
>> it mounted. The rest of the resources started as expected until the
>> p_exportfs_mount2 resource, which failed as follows:
>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
>> (unmanaged) FAILED
>>
>> I ran cleanup on this and it started, however when running this test
>> earlier today no command could successfully start this exportfs resource.
>>
>> How can I configure pacemaker to better resolve these problems and be
>> able to bring the node up successfully on its own? What can I check to
>> determine why these failures are occuring? /var/log/syslog did not seem
>> to contain very much useful information regarding why the failures
> occurred.
>>
>> Thanks,
>>
>> Andrew
>>
>>
>>
>>
>> This body part will be downloaded on demand.
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
Attachments: signature.asc (0.22 KB)


amartin at xes-inc

Mar 27, 2012, 3:13 PM

Post #9 of 16 (1677 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

Hi Andreas,


Thanks, I've updated the colocation rule to be in the correct order. I also enabled the STONITH resource (this was temporarily disabled before for some additional testing). DRBD has its own network connection over the br1 interface (192.168.5.0/24 network), a direct crossover cable between node1 and node2:

global { usage-count no; }
common {
syncer { rate 110M; }
}
resource vmstore {
protocol C;
startup {
wfc-timeout 15;
degr-wfc-timeout 60;
}
handlers {
#fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
fence-peer "/usr/local/bin/fence-peer";
split-brain "/usr/lib/drbd/notify-split-brain.sh me [at] example";
}
net {
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
cram-hmac-alg md5;
shared-secret "xxxxx";
}
disk {
fencing resource-only;
}
on node1 {
device /dev/drbd0;
disk /dev/sdb1;
address 192.168.5.10:7787;
meta-disk internal;
}
on node2 {
device /dev/drbd0;
disk /dev/sdf1;
address 192.168.5.11:7787;
meta-disk internal;
}
}
# and similar for mount1 and mount2


Also, here is my ha.cf. It uses both the direct link between the nodes (br1) and the shared LAN network on br0 for communicating:

autojoin none
mcast br0 239.0.0.43 694 1 0
bcast br1
warntime 5
deadtime 15
initdead 60
keepalive 2
node node1
node node2
node quorumnode
crm respawn
respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster


I am thinking of making the following changes to the CIB (as per the official DRBD guide http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html ) in order to add the DRBD lsb service and require that it start before the ocf:linbit:drbd resources. Does this look correct?
primitive p_drbd-init lsb:drbd op monitor interval="30"
colocation c_drbd_together inf: p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master ms_drbd_mount2:Master
order drbd_init_first inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start


This doesn't seem to require that drbd be also running on the node where the ocf:linbit:drbd resources are slave (which it would need to do to be a DRBD SyncTarget) - how can I ensure that drbd is running everywhere? (clone cl_drbd p_drbd-init ?)


Thanks,


Andrew ----- Original Message -----

From: "Andreas Kurz" <andreas [at] hastexo>
To: pacemaker [at] oss
Sent: Monday, March 26, 2012 5:56:22 PM
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover

On 03/24/2012 08:15 PM, Andrew Martin wrote:
> Hi Andreas,
>
> My complete cluster configuration is as follows:
> ============
> Last updated: Sat Mar 24 13:51:55 2012
> Last change: Sat Mar 24 13:41:55 2012
> Stack: Heartbeat
> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
> with quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 3 Nodes configured, unknown expected votes
> 19 Resources configured.
> ============
>
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
> Online: [ node2 node1 ]
>
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Resource Group: g_vm
> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
> p_vm(ocf::heartbeat:VirtualDomain):Started node2
> Clone Set: cl_daemons [g_daemons]
> Started: [ node2 node1 ]
> Stopped: [ g_daemons:2 ]
> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
> Started: [ node2 node1 ]
> Stopped: [ p_sysadmin_notify:2 ]
> stonith-node1(stonith:external/tripplitepdu):Started node2
> stonith-node2(stonith:external/tripplitepdu):Started node1
> Clone Set: cl_ping [p_ping]
> Started: [ node2 node1 ]
> Stopped: [ p_ping:2 ]
>
> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
> attributes standby="off"
> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
> attributes standby="off"
> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
> attributes standby="on"
> primitive p_drbd_mount2 ocf:linbit:drbd \
> params drbd_resource="mount2" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_mount1 ocf:linbit:drbd \
> params drbd_resource="mount1" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_vmstore ocf:linbit:drbd \
> params drbd_resource="vmstore" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
> op start interval="0" timeout="60s" \
> op stop interval="0" timeout="60s" \
> op monitor interval="20s" timeout="40s"
> primitive p_libvirt-bin upstart:libvirt-bin \
> op monitor interval="30"
> primitive p_ping ocf:pacemaker:ping \
> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
> multiplier="1000" \
> op monitor interval="20s"
> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
> params email="me [at] example" \
> params subject="Pacemaker Change" \
> op start interval="0" timeout="10" \
> op stop interval="0" timeout="10" \
> op monitor interval="10" timeout="10"
> primitive p_vm ocf:heartbeat:VirtualDomain \
> params config="/vmstore/config/vm.xml" \
> meta allow-migrate="false" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s" \
> op monitor interval="10" timeout="30"
> primitive stonith-node1 stonith:external/tripplitepdu \
> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"
> pdu_password="xxx" hostname_to_stonith="node1"
> primitive stonith-node2 stonith:external/tripplitepdu \
> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"
> pdu_password="xxx" hostname_to_stonith="node2"
> group g_daemons p_libvirt-bin
> group g_vm p_fs_vmstore p_vm
> ms ms_drbd_mount2 p_drbd_mount2 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_mount1 p_drbd_mount1 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_vmstore p_drbd_vmstore \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> clone cl_daemons g_daemons
> clone cl_ping p_ping \
> meta interleave="true"
> clone cl_sysadmin_notify p_sysadmin_notify
> location l-st-node1 stonith-node1 -inf: node1
> location l-st-node2 stonith-node2 -inf: node2
> location l_run_on_most_connected p_vm \
> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm

As Emmanuel already said, g_vm has to be in the first place in this
collocation constraint .... g_vm must be colocated with the drbd masters.

> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
> ms_drbd_mount2:promote cl_daemons:start g_vm:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false" \
> no-quorum-policy="stop" \
> last-lrm-refresh="1332539900" \
> cluster-recheck-interval="5m" \
> crmd-integration-timeout="3m" \
> shutdown-escalation="5m"
>
> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
> PDUMH20ATNET that I'm using as the STONITH device:
> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf

And why don't using it? .... stonith-enabled="false"

>
> As you can see, I left the DRBD service to be started by the operating
> system (as an lsb script at boot time) however Pacemaker controls
> actually bringing up/taking down the individual DRBD devices.

Don't start drbd on system boot, give Pacemaker the full control.

The
> behavior I observe is as follows: I issue "crm resource migrate p_vm" on
> node1 and failover successfully to node2. During this time, node2 fences
> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile
> node2's DRBD devices are UpToDate. I then shutdown both nodes and then
> bring them back up. They reconnect to the cluster (with quorum), and
> node1's DRBD devices are still Outdated as expected and node2's DRBD
> devices are still UpToDate, as expected. At this point, DRBD starts on
> both nodes, however node2 will not set DRBD as master:
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
> Online: [ node2 node1 ]
>
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> Slaves: [ node1 node2 ]
> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> Slaves: [ node1 node 2 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> Slaves: [ node1 node2 ]

There should really be no interruption of the drbd replication on vm
migration that activates the dopd ... drbd has its own direct network
connection?

Please share your ha.cf file and your drbd configuration. Watch out for
drbd messages in your kernel log file, that should give you additional
information when/why the drbd connection was lost.

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> I am having trouble sorting through the logging information because
> there is so much of it in /var/log/daemon.log, but I can't find an
> error message printed about why it will not promote node2. At this point
> the DRBD devices are as follows:
> node2: cstate = WFConnection dstate=UpToDate
> node1: cstate = StandAlone dstate=Outdated
>
> I don't see any reason why node2 can't become DRBD master, or am I
> missing something? If I do "drbdadm connect all" on node1, then the
> cstate on both nodes changes to "Connected" and node2 immediately
> promotes the DRBD resources to master. Any ideas on why I'm observing
> this incorrect behavior?
>
> Any tips on how I can better filter through the pacemaker/heartbeat logs
> or how to get additional useful debug information?
>
> Thanks,
>
> Andrew
>
> ------------------------------------------------------------------------
> *From: *"Andreas Kurz" <andreas [at] hastexo>
> *To: *pacemaker [at] oss
> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 01/25/2012 08:58 PM, Andrew Martin wrote:
>> Hello,
>>
>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6
>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
>> the following resources:
>> - primitives for DRBD storage devices
>> - primitives for mounting the filesystem on the DRBD storage
>> - primitives for some mount binds
>> - primitive for starting apache
>> - primitives for starting samba and nfs servers (following instructions
>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>
> not enough information ... please share at least your complete cluster
> configuration
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>>
>> Perhaps this is best described through the output of crm_mon:
>> Online: [ node1 node2 ]
>>
>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2 (unmanaged)
>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
>> (unmanaged) FAILED
>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
>> (unmanaged) FAILED
>> Slaves: [ node2 ]
>> Resource Group: g_core
>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
>> Resource Group: g_apache
>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
>> p_apache (ocf::heartbeat:apache): Started node1
>> Resource Group: g_fileservers
>> p_lsb_smb (lsb:smbd): Started node1
>> p_lsb_nmb (lsb:nmbd): Started node1
>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
>>
>> I have read through the Pacemaker Explained
>>
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
>> documentation, however could not find a way to further debug these
>> problems. First, I put node1 into standby mode to attempt failover to
>> the other node (node2). Node2 appeared to start the transition to
>> master, however it failed to promote the DRBD resources to master (the
>> first step). I have attached a copy of this session in commands.log and
>> additional excerpts from /var/log/syslog during important steps. I have
>> attempted everything I can think of to try and start the DRBD resource
>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
>> heartbeat) but cannot bring it out of the slave state. However, if I set
>> it to unmanaged and then run drbdadm primary all in the terminal,
>> pacemaker is satisfied and continues starting the rest of the resources.
>> It then failed when attempting to mount the filesystem for mount2, the
>> p_fs_mount2 resource. I attempted to mount the filesystem myself and was
>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
>> it mounted. The rest of the resources started as expected until the
>> p_exportfs_mount2 resource, which failed as follows:
>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
>> (unmanaged) FAILED
>>
>> I ran cleanup on this and it started, however when running this test
>> earlier today no command could successfully start this exportfs resource.
>>
>> How can I configure pacemaker to better resolve these problems and be
>> able to bring the node up successfully on its own? What can I check to
>> determine why these failures are occuring? /var/log/syslog did not seem
>> to contain very much useful information regarding why the failures
> occurred.
>>
>> Thanks,
>>
>> Andrew
>>
>>
>>
>>
>> This body part will be downloaded on demand.
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andreas at hastexo

Mar 28, 2012, 5:27 AM

Post #10 of 16 (1670 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

On 03/28/2012 12:13 AM, Andrew Martin wrote:
> Hi Andreas,
>
> Thanks, I've updated the colocation rule to be in the correct order. I
> also enabled the STONITH resource (this was temporarily disabled before
> for some additional testing). DRBD has its own network connection over
> the br1 interface (192.168.5.0/24 network), a direct crossover cable
> between node1 and node2:
> global { usage-count no; }
> common {
> syncer { rate 110M; }
> }
> resource vmstore {
> protocol C;
> startup {
> wfc-timeout 15;
> degr-wfc-timeout 60;
> }
> handlers {
> #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
> fence-peer "/usr/local/bin/fence-peer";

hmm ... what is that fence-peer script doing? If you want to use
resource-level fencing with the help of dopd, activate the
drbd-peer-outdater script in the line above ... and double check if the
path is correct

> split-brain "/usr/lib/drbd/notify-split-brain.sh
> me [at] example";
> }
> net {
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri disconnect;
> cram-hmac-alg md5;
> shared-secret "xxxxx";
> }
> disk {
> fencing resource-only;
> }
> on node1 {
> device /dev/drbd0;
> disk /dev/sdb1;
> address 192.168.5.10:7787;
> meta-disk internal;
> }
> on node2 {
> device /dev/drbd0;
> disk /dev/sdf1;
> address 192.168.5.11:7787;
> meta-disk internal;
> }
> }
> # and similar for mount1 and mount2
>
> Also, here is my ha.cf. It uses both the direct link between the nodes
> (br1) and the shared LAN network on br0 for communicating:
> autojoin none
> mcast br0 239.0.0.43 694 1 0
> bcast br1
> warntime 5
> deadtime 15
> initdead 60
> keepalive 2
> node node1
> node node2
> node quorumnode
> crm respawn
> respawn hacluster /usr/lib/heartbeat/dopd
> apiauth dopd gid=haclient uid=hacluster
>
> I am thinking of making the following changes to the CIB (as per the
> official DRBD
> guide http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in
> order to add the DRBD lsb service and require that it start before the
> ocf:linbit:drbd resources. Does this look correct?

Where did you read that? No, deactivate the startup of DRBD on system
boot and let Pacemaker manage it completely.

> primitive p_drbd-init lsb:drbd op monitor interval="30"
> colocation c_drbd_together inf:
> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
> ms_drbd_mount2:Master
> order drbd_init_first inf: ms_drbd_vmstore:promote
> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>
> This doesn't seem to require that drbd be also running on the node where
> the ocf:linbit:drbd resources are slave (which it would need to do to be
> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
> (clone cl_drbd p_drbd-init ?)

This is really not needed.

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> Thanks,
>
> Andrew
> ------------------------------------------------------------------------
> *From: *"Andreas Kurz" <andreas [at] hastexo>
> *To: *pacemaker [at] oss
> *Sent: *Monday, March 26, 2012 5:56:22 PM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 03/24/2012 08:15 PM, Andrew Martin wrote:
>> Hi Andreas,
>>
>> My complete cluster configuration is as follows:
>> ============
>> Last updated: Sat Mar 24 13:51:55 2012
>> Last change: Sat Mar 24 13:41:55 2012
>> Stack: Heartbeat
>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
>> with quorum
>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>> 3 Nodes configured, unknown expected votes
>> 19 Resources configured.
>> ============
>>
>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
>> Online: [ node2 node1 ]
>>
>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>> Masters: [ node2 ]
>> Slaves: [ node1 ]
>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>> Masters: [ node2 ]
>> Slaves: [ node1 ]
>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>> Masters: [ node2 ]
>> Slaves: [ node1 ]
>> Resource Group: g_vm
>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
>> p_vm(ocf::heartbeat:VirtualDomain):Started node2
>> Clone Set: cl_daemons [g_daemons]
>> Started: [ node2 node1 ]
>> Stopped: [ g_daemons:2 ]
>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>> Started: [ node2 node1 ]
>> Stopped: [ p_sysadmin_notify:2 ]
>> stonith-node1(stonith:external/tripplitepdu):Started node2
>> stonith-node2(stonith:external/tripplitepdu):Started node1
>> Clone Set: cl_ping [p_ping]
>> Started: [ node2 node1 ]
>> Stopped: [ p_ping:2 ]
>>
>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
>> attributes standby="off"
>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
>> attributes standby="off"
>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
>> attributes standby="on"
>> primitive p_drbd_mount2 ocf:linbit:drbd \
>> params drbd_resource="mount2" \
>> op monitor interval="15" role="Master" \
>> op monitor interval="30" role="Slave"
>> primitive p_drbd_mount1 ocf:linbit:drbd \
>> params drbd_resource="mount1" \
>> op monitor interval="15" role="Master" \
>> op monitor interval="30" role="Slave"
>> primitive p_drbd_vmstore ocf:linbit:drbd \
>> params drbd_resource="vmstore" \
>> op monitor interval="15" role="Master" \
>> op monitor interval="30" role="Slave"
>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
>> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
>> op start interval="0" timeout="60s" \
>> op stop interval="0" timeout="60s" \
>> op monitor interval="20s" timeout="40s"
>> primitive p_libvirt-bin upstart:libvirt-bin \
>> op monitor interval="30"
>> primitive p_ping ocf:pacemaker:ping \
>> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
>> multiplier="1000" \
>> op monitor interval="20s"
>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
>> params email="me [at] example" \
>> params subject="Pacemaker Change" \
>> op start interval="0" timeout="10" \
>> op stop interval="0" timeout="10" \
>> op monitor interval="10" timeout="10"
>> primitive p_vm ocf:heartbeat:VirtualDomain \
>> params config="/vmstore/config/vm.xml" \
>> meta allow-migrate="false" \
>> op start interval="0" timeout="120s" \
>> op stop interval="0" timeout="120s" \
>> op monitor interval="10" timeout="30"
>> primitive stonith-node1 stonith:external/tripplitepdu \
>> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"
>> pdu_password="xxx" hostname_to_stonith="node1"
>> primitive stonith-node2 stonith:external/tripplitepdu \
>> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"
>> pdu_password="xxx" hostname_to_stonith="node2"
>> group g_daemons p_libvirt-bin
>> group g_vm p_fs_vmstore p_vm
>> ms ms_drbd_mount2 p_drbd_mount2 \
>> meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> ms ms_drbd_mount1 p_drbd_mount1 \
>> meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> ms ms_drbd_vmstore p_drbd_vmstore \
>> meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> clone cl_daemons g_daemons
>> clone cl_ping p_ping \
>> meta interleave="true"
>> clone cl_sysadmin_notify p_sysadmin_notify
>> location l-st-node1 stonith-node1 -inf: node1
>> location l-st-node2 stonith-node2 -inf: node2
>> location l_run_on_most_connected p_vm \
>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
>
> As Emmanuel already said, g_vm has to be in the first place in this
> collocation constraint .... g_vm must be colocated with the drbd masters.
>
>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
>> property $id="cib-bootstrap-options" \
>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>> cluster-infrastructure="Heartbeat" \
>> stonith-enabled="false" \
>> no-quorum-policy="stop" \
>> last-lrm-refresh="1332539900" \
>> cluster-recheck-interval="5m" \
>> crmd-integration-timeout="3m" \
>> shutdown-escalation="5m"
>>
>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
>> PDUMH20ATNET that I'm using as the STONITH device:
>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>
> And why don't using it? .... stonith-enabled="false"
>
>>
>> As you can see, I left the DRBD service to be started by the operating
>> system (as an lsb script at boot time) however Pacemaker controls
>> actually bringing up/taking down the individual DRBD devices.
>
> Don't start drbd on system boot, give Pacemaker the full control.
>
> The
>> behavior I observe is as follows: I issue "crm resource migrate p_vm" on
>> node1 and failover successfully to node2. During this time, node2 fences
>> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile
>> node2's DRBD devices are UpToDate. I then shutdown both nodes and then
>> bring them back up. They reconnect to the cluster (with quorum), and
>> node1's DRBD devices are still Outdated as expected and node2's DRBD
>> devices are still UpToDate, as expected. At this point, DRBD starts on
>> both nodes, however node2 will not set DRBD as master:
>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
>> Online: [ node2 node1 ]
>>
>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>> Slaves: [ node1 node2 ]
>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>> Slaves: [ node1 node 2 ]
>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>> Slaves: [ node1 node2 ]
>
> There should really be no interruption of the drbd replication on vm
> migration that activates the dopd ... drbd has its own direct network
> connection?
>
> Please share your ha.cf file and your drbd configuration. Watch out for
> drbd messages in your kernel log file, that should give you additional
> information when/why the drbd connection was lost.
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>>
>> I am having trouble sorting through the logging information because
>> there is so much of it in /var/log/daemon.log, but I can't find an
>> error message printed about why it will not promote node2. At this point
>> the DRBD devices are as follows:
>> node2: cstate = WFConnection dstate=UpToDate
>> node1: cstate = StandAlone dstate=Outdated
>>
>> I don't see any reason why node2 can't become DRBD master, or am I
>> missing something? If I do "drbdadm connect all" on node1, then the
>> cstate on both nodes changes to "Connected" and node2 immediately
>> promotes the DRBD resources to master. Any ideas on why I'm observing
>> this incorrect behavior?
>>
>> Any tips on how I can better filter through the pacemaker/heartbeat logs
>> or how to get additional useful debug information?
>>
>> Thanks,
>>
>> Andrew
>>
>> ------------------------------------------------------------------------
>> *From: *"Andreas Kurz" <andreas [at] hastexo>
>> *To: *pacemaker [at] oss
>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>> master on failover
>>
>> On 01/25/2012 08:58 PM, Andrew Martin wrote:
>>> Hello,
>>>
>>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6
>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
>>> the following resources:
>>> - primitives for DRBD storage devices
>>> - primitives for mounting the filesystem on the DRBD storage
>>> - primitives for some mount binds
>>> - primitive for starting apache
>>> - primitives for starting samba and nfs servers (following instructions
>>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>>
>> not enough information ... please share at least your complete cluster
>> configuration
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> Perhaps this is best described through the output of crm_mon:
>>> Online: [ node1 node2 ]
>>>
>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
>>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2
> (unmanaged)
>>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
>>> (unmanaged) FAILED
>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
>>> (unmanaged) FAILED
>>> Slaves: [ node2 ]
>>> Resource Group: g_core
>>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
>>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
>>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
>>> Resource Group: g_apache
>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
>>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
>>> p_apache (ocf::heartbeat:apache): Started node1
>>> Resource Group: g_fileservers
>>> p_lsb_smb (lsb:smbd): Started node1
>>> p_lsb_nmb (lsb:nmbd): Started node1
>>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
>>>
>>> I have read through the Pacemaker Explained
>>>
>>
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
>>> documentation, however could not find a way to further debug these
>>> problems. First, I put node1 into standby mode to attempt failover to
>>> the other node (node2). Node2 appeared to start the transition to
>>> master, however it failed to promote the DRBD resources to master (the
>>> first step). I have attached a copy of this session in commands.log and
>>> additional excerpts from /var/log/syslog during important steps. I have
>>> attempted everything I can think of to try and start the DRBD resource
>>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
>>> heartbeat) but cannot bring it out of the slave state. However, if I set
>>> it to unmanaged and then run drbdadm primary all in the terminal,
>>> pacemaker is satisfied and continues starting the rest of the resources.
>>> It then failed when attempting to mount the filesystem for mount2, the
>>> p_fs_mount2 resource. I attempted to mount the filesystem myself and was
>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
>>> it mounted. The rest of the resources started as expected until the
>>> p_exportfs_mount2 resource, which failed as follows:
>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
>>> (unmanaged) FAILED
>>>
>>> I ran cleanup on this and it started, however when running this test
>>> earlier today no command could successfully start this exportfs resource.
>>>
>>> How can I configure pacemaker to better resolve these problems and be
>>> able to bring the node up successfully on its own? What can I check to
>>> determine why these failures are occuring? /var/log/syslog did not seem
>>> to contain very much useful information regarding why the failures
>> occurred.
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>>
>>>
>>>
>>> This body part will be downloaded on demand.
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
Attachments: signature.asc (0.22 KB)


amartin at xes-inc

Mar 28, 2012, 6:47 AM

Post #11 of 16 (1674 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

Hi Andreas,


> hmm ... what is that fence-peer script doing? If you want to use
> resource-level fencing with the help of dopd, activate the
> drbd-peer-outdater script in the line above ... and double check if the
> path is correct
fence-peer is just a wrapper for drbd-peer-outdater that does some additional logging. In my testing dopd has been working well.



>> I am thinking of making the following changes to the CIB (as per the
>> official DRBD
>> guide http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in
>> order to add the DRBD lsb service and require that it start before the
>> ocf:linbit:drbd resources. Does this look correct?
>
> Where did you read that? No, deactivate the startup of DRBD on system
> boot and let Pacemaker manage it completely.
>
>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>> colocation c_drbd_together inf:
>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>> ms_drbd_mount2:Master
>> order drbd_init_first inf: ms_drbd_vmstore:promote
>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>
>> This doesn't seem to require that drbd be also running on the node where
>> the ocf:linbit:drbd resources are slave (which it would need to do to be
>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
>> (clone cl_drbd p_drbd-init ?)
>
> This is really not needed.
I was following the official DRBD Users Guide:
http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html


If I am understanding your previous message correctly, I do not need to add a lsb primitive for the drbd daemon? It will be started/stopped/managed automatically by my ocf:linbit:drbd resources (and I can remove the /etc/rc* symlinks)?


Thanks,


Andrew
----- Original Message -----

From: "Andreas Kurz" < andreas [at] hastexo >
To: pacemaker [at] oss
Sent: Wednesday, March 28, 2012 7:27:34 AM
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover

On 03/28/2012 12:13 AM, Andrew Martin wrote:
> Hi Andreas,
>
> Thanks, I've updated the colocation rule to be in the correct order. I
> also enabled the STONITH resource (this was temporarily disabled before
> for some additional testing). DRBD has its own network connection over
> the br1 interface (192.168.5.0/24 network), a direct crossover cable
> between node1 and node2:
> global { usage-count no; }
> common {
> syncer { rate 110M; }
> }
> resource vmstore {
> protocol C;
> startup {
> wfc-timeout 15;
> degr-wfc-timeout 60;
> }
> handlers {
> #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
> fence-peer "/usr/local/bin/fence-peer";

hmm ... what is that fence-peer script doing? If you want to use
resource-level fencing with the help of dopd, activate the
drbd-peer-outdater script in the line above ... and double check if the
path is correct

> split-brain "/usr/lib/drbd/notify-split-brain.sh
> me [at] example ";
> }
> net {
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri disconnect;
> cram-hmac-alg md5;
> shared-secret "xxxxx";
> }
> disk {
> fencing resource-only;
> }
> on node1 {
> device /dev/drbd0;
> disk /dev/sdb1;
> address 192.168.5.10:7787;
> meta-disk internal;
> }
> on node2 {
> device /dev/drbd0;
> disk /dev/sdf1;
> address 192.168.5.11:7787;
> meta-disk internal;
> }
> }
> # and similar for mount1 and mount2
>
> Also, here is my ha.cf. It uses both the direct link between the nodes
> (br1) and the shared LAN network on br0 for communicating:
> autojoin none
> mcast br0 239.0.0.43 694 1 0
> bcast br1
> warntime 5
> deadtime 15
> initdead 60
> keepalive 2
> node node1
> node node2
> node quorumnode
> crm respawn
> respawn hacluster /usr/lib/heartbeat/dopd
> apiauth dopd gid=haclient uid=hacluster
>
> I am thinking of making the following changes to the CIB (as per the
> official DRBD
> guide http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html ) in
> order to add the DRBD lsb service and require that it start before the
> ocf:linbit:drbd resources. Does this look correct?

Where did you read that? No, deactivate the startup of DRBD on system
boot and let Pacemaker manage it completely.

> primitive p_drbd-init lsb:drbd op monitor interval="30"
> colocation c_drbd_together inf:
> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
> ms_drbd_mount2:Master
> order drbd_init_first inf: ms_drbd_vmstore:promote
> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>
> This doesn't seem to require that drbd be also running on the node where
> the ocf:linbit:drbd resources are slave (which it would need to do to be
> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
> (clone cl_drbd p_drbd-init ?)

This is really not needed.

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> Thanks,
>
> Andrew
> ------------------------------------------------------------------------
> *From: *"Andreas Kurz" < andreas [at] hastexo >
> *To: *pacemaker [at] oss
> *Sent: *Monday, March 26, 2012 5:56:22 PM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 03/24/2012 08:15 PM, Andrew Martin wrote:
>> Hi Andreas,
>>
>> My complete cluster configuration is as follows:
>> ============
>> Last updated: Sat Mar 24 13:51:55 2012
>> Last change: Sat Mar 24 13:41:55 2012
>> Stack: Heartbeat
>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
>> with quorum
>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>> 3 Nodes configured, unknown expected votes
>> 19 Resources configured.
>> ============
>>
>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
>> Online: [ node2 node1 ]
>>
>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>> Masters: [ node2 ]
>> Slaves: [ node1 ]
>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>> Masters: [ node2 ]
>> Slaves: [ node1 ]
>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>> Masters: [ node2 ]
>> Slaves: [ node1 ]
>> Resource Group: g_vm
>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
>> p_vm(ocf::heartbeat:VirtualDomain):Started node2
>> Clone Set: cl_daemons [g_daemons]
>> Started: [ node2 node1 ]
>> Stopped: [ g_daemons:2 ]
>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>> Started: [ node2 node1 ]
>> Stopped: [ p_sysadmin_notify:2 ]
>> stonith-node1(stonith:external/tripplitepdu):Started node2
>> stonith-node2(stonith:external/tripplitepdu):Started node1
>> Clone Set: cl_ping [p_ping]
>> Started: [ node2 node1 ]
>> Stopped: [ p_ping:2 ]
>>
>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
>> attributes standby="off"
>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
>> attributes standby="off"
>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
>> attributes standby="on"
>> primitive p_drbd_mount2 ocf:linbit:drbd \
>> params drbd_resource="mount2" \
>> op monitor interval="15" role="Master" \
>> op monitor interval="30" role="Slave"
>> primitive p_drbd_mount1 ocf:linbit:drbd \
>> params drbd_resource="mount1" \
>> op monitor interval="15" role="Master" \
>> op monitor interval="30" role="Slave"
>> primitive p_drbd_vmstore ocf:linbit:drbd \
>> params drbd_resource="vmstore" \
>> op monitor interval="15" role="Master" \
>> op monitor interval="30" role="Slave"
>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
>> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
>> op start interval="0" timeout="60s" \
>> op stop interval="0" timeout="60s" \
>> op monitor interval="20s" timeout="40s"
>> primitive p_libvirt-bin upstart:libvirt-bin \
>> op monitor interval="30"
>> primitive p_ping ocf:pacemaker:ping \
>> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
>> multiplier="1000" \
>> op monitor interval="20s"
>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
>> params email=" me [at] example " \
>> params subject="Pacemaker Change" \
>> op start interval="0" timeout="10" \
>> op stop interval="0" timeout="10" \
>> op monitor interval="10" timeout="10"
>> primitive p_vm ocf:heartbeat:VirtualDomain \
>> params config="/vmstore/config/vm.xml" \
>> meta allow-migrate="false" \
>> op start interval="0" timeout="120s" \
>> op stop interval="0" timeout="120s" \
>> op monitor interval="10" timeout="30"
>> primitive stonith-node1 stonith:external/tripplitepdu \
>> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"
>> pdu_password="xxx" hostname_to_stonith="node1"
>> primitive stonith-node2 stonith:external/tripplitepdu \
>> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"
>> pdu_password="xxx" hostname_to_stonith="node2"
>> group g_daemons p_libvirt-bin
>> group g_vm p_fs_vmstore p_vm
>> ms ms_drbd_mount2 p_drbd_mount2 \
>> meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> ms ms_drbd_mount1 p_drbd_mount1 \
>> meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> ms ms_drbd_vmstore p_drbd_vmstore \
>> meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> clone cl_daemons g_daemons
>> clone cl_ping p_ping \
>> meta interleave="true"
>> clone cl_sysadmin_notify p_sysadmin_notify
>> location l-st-node1 stonith-node1 -inf: node1
>> location l-st-node2 stonith-node2 -inf: node2
>> location l_run_on_most_connected p_vm \
>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
>
> As Emmanuel already said, g_vm has to be in the first place in this
> collocation constraint .... g_vm must be colocated with the drbd masters.
>
>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
>> property $id="cib-bootstrap-options" \
>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>> cluster-infrastructure="Heartbeat" \
>> stonith-enabled="false" \
>> no-quorum-policy="stop" \
>> last-lrm-refresh="1332539900" \
>> cluster-recheck-interval="5m" \
>> crmd-integration-timeout="3m" \
>> shutdown-escalation="5m"
>>
>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
>> PDUMH20ATNET that I'm using as the STONITH device:
>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>
> And why don't using it? .... stonith-enabled="false"
>
>>
>> As you can see, I left the DRBD service to be started by the operating
>> system (as an lsb script at boot time) however Pacemaker controls
>> actually bringing up/taking down the individual DRBD devices.
>
> Don't start drbd on system boot, give Pacemaker the full control.
>
> The
>> behavior I observe is as follows: I issue "crm resource migrate p_vm" on
>> node1 and failover successfully to node2. During this time, node2 fences
>> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile
>> node2's DRBD devices are UpToDate. I then shutdown both nodes and then
>> bring them back up. They reconnect to the cluster (with quorum), and
>> node1's DRBD devices are still Outdated as expected and node2's DRBD
>> devices are still UpToDate, as expected. At this point, DRBD starts on
>> both nodes, however node2 will not set DRBD as master:
>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
>> Online: [ node2 node1 ]
>>
>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>> Slaves: [ node1 node2 ]
>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>> Slaves: [ node1 node 2 ]
>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>> Slaves: [ node1 node2 ]
>
> There should really be no interruption of the drbd replication on vm
> migration that activates the dopd ... drbd has its own direct network
> connection?
>
> Please share your ha.cf file and your drbd configuration. Watch out for
> drbd messages in your kernel log file, that should give you additional
> information when/why the drbd connection was lost.
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>>
>> I am having trouble sorting through the logging information because
>> there is so much of it in /var/log/daemon.log, but I can't find an
>> error message printed about why it will not promote node2. At this point
>> the DRBD devices are as follows:
>> node2: cstate = WFConnection dstate=UpToDate
>> node1: cstate = StandAlone dstate=Outdated
>>
>> I don't see any reason why node2 can't become DRBD master, or am I
>> missing something? If I do "drbdadm connect all" on node1, then the
>> cstate on both nodes changes to "Connected" and node2 immediately
>> promotes the DRBD resources to master. Any ideas on why I'm observing
>> this incorrect behavior?
>>
>> Any tips on how I can better filter through the pacemaker/heartbeat logs
>> or how to get additional useful debug information?
>>
>> Thanks,
>>
>> Andrew
>>
>> ------------------------------------------------------------------------
>> *From: *"Andreas Kurz" < andreas [at] hastexo >
>> *To: *pacemaker [at] oss
>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>> master on failover
>>
>> On 01/25/2012 08:58 PM, Andrew Martin wrote:
>>> Hello,
>>>
>>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6
>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
>>> the following resources:
>>> - primitives for DRBD storage devices
>>> - primitives for mounting the filesystem on the DRBD storage
>>> - primitives for some mount binds
>>> - primitive for starting apache
>>> - primitives for starting samba and nfs servers (following instructions
>>> here < http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf >)
>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>>
>> not enough information ... please share at least your complete cluster
>> configuration
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> Perhaps this is best described through the output of crm_mon:
>>> Online: [ node1 node2 ]
>>>
>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
>>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2
> (unmanaged)
>>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
>>> (unmanaged) FAILED
>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
>>> (unmanaged) FAILED
>>> Slaves: [ node2 ]
>>> Resource Group: g_core
>>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
>>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
>>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
>>> Resource Group: g_apache
>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
>>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
>>> p_apache (ocf::heartbeat:apache): Started node1
>>> Resource Group: g_fileservers
>>> p_lsb_smb (lsb:smbd): Started node1
>>> p_lsb_nmb (lsb:nmbd): Started node1
>>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
>>>
>>> I have read through the Pacemaker Explained
>>>
>>
> < http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained >
>>> documentation, however could not find a way to further debug these
>>> problems. First, I put node1 into standby mode to attempt failover to
>>> the other node (node2). Node2 appeared to start the transition to
>>> master, however it failed to promote the DRBD resources to master (the
>>> first step). I have attached a copy of this session in commands.log and
>>> additional excerpts from /var/log/syslog during important steps. I have
>>> attempted everything I can think of to try and start the DRBD resource
>>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
>>> heartbeat) but cannot bring it out of the slave state. However, if I set
>>> it to unmanaged and then run drbdadm primary all in the terminal,
>>> pacemaker is satisfied and continues starting the rest of the resources.
>>> It then failed when attempting to mount the filesystem for mount2, the
>>> p_fs_mount2 resource. I attempted to mount the filesystem myself and was
>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
>>> it mounted. The rest of the resources started as expected until the
>>> p_exportfs_mount2 resource, which failed as follows:
>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
>>> (unmanaged) FAILED
>>>
>>> I ran cleanup on this and it started, however when running this test
>>> earlier today no command could successfully start this exportfs resource.
>>>
>>> How can I configure pacemaker to better resolve these problems and be
>>> able to bring the node up successfully on its own? What can I check to
>>> determine why these failures are occuring? /var/log/syslog did not seem
>>> to contain very much useful information regarding why the failures
>> occurred.
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>>
>>>
>>>
>>> This body part will be downloaded on demand.
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andreas at hastexo

Mar 28, 2012, 7:03 AM

Post #12 of 16 (1677 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

On 03/28/2012 03:47 PM, Andrew Martin wrote:
> Hi Andreas,
>
>> hmm ... what is that fence-peer script doing? If you want to use
>> resource-level fencing with the help of dopd, activate the
>> drbd-peer-outdater script in the line above ... and double check if the
>> path is correct
> fence-peer is just a wrapper for drbd-peer-outdater that does some
> additional logging. In my testing dopd has been working well.

I see

>
>>> I am thinking of making the following changes to the CIB (as per the
>>> official DRBD
>>> guide
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in
>>> order to add the DRBD lsb service and require that it start before the
>>> ocf:linbit:drbd resources. Does this look correct?
>>
>> Where did you read that? No, deactivate the startup of DRBD on system
>> boot and let Pacemaker manage it completely.
>>
>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>>> colocation c_drbd_together inf:
>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>>> ms_drbd_mount2:Master
>>> order drbd_init_first inf: ms_drbd_vmstore:promote
>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>>
>>> This doesn't seem to require that drbd be also running on the node where
>>> the ocf:linbit:drbd resources are slave (which it would need to do to be
>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
>>> (clone cl_drbd p_drbd-init ?)
>>
>> This is really not needed.
> I was following the official DRBD Users Guide:
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html
>
> If I am understanding your previous message correctly, I do not need to
> add a lsb primitive for the drbd daemon? It will be
> started/stopped/managed automatically by my ocf:linbit:drbd resources
> (and I can remove the /etc/rc* symlinks)?

Yes, you don't need that LSB script when using Pacemaker and should not
let init start it.

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> Thanks,
>
> Andrew
>
> ------------------------------------------------------------------------
> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo>>
> *To: *pacemaker [at] oss <mailto:pacemaker [at] oss>
> *Sent: *Wednesday, March 28, 2012 7:27:34 AM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 03/28/2012 12:13 AM, Andrew Martin wrote:
>> Hi Andreas,
>>
>> Thanks, I've updated the colocation rule to be in the correct order. I
>> also enabled the STONITH resource (this was temporarily disabled before
>> for some additional testing). DRBD has its own network connection over
>> the br1 interface (192.168.5.0/24 network), a direct crossover cable
>> between node1 and node2:
>> global { usage-count no; }
>> common {
>> syncer { rate 110M; }
>> }
>> resource vmstore {
>> protocol C;
>> startup {
>> wfc-timeout 15;
>> degr-wfc-timeout 60;
>> }
>> handlers {
>> #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
>> fence-peer "/usr/local/bin/fence-peer";
>
> hmm ... what is that fence-peer script doing? If you want to use
> resource-level fencing with the help of dopd, activate the
> drbd-peer-outdater script in the line above ... and double check if the
> path is correct
>
>> split-brain "/usr/lib/drbd/notify-split-brain.sh
>> me [at] example <mailto:me [at] example>";
>> }
>> net {
>> after-sb-0pri discard-zero-changes;
>> after-sb-1pri discard-secondary;
>> after-sb-2pri disconnect;
>> cram-hmac-alg md5;
>> shared-secret "xxxxx";
>> }
>> disk {
>> fencing resource-only;
>> }
>> on node1 {
>> device /dev/drbd0;
>> disk /dev/sdb1;
>> address 192.168.5.10:7787;
>> meta-disk internal;
>> }
>> on node2 {
>> device /dev/drbd0;
>> disk /dev/sdf1;
>> address 192.168.5.11:7787;
>> meta-disk internal;
>> }
>> }
>> # and similar for mount1 and mount2
>>
>> Also, here is my ha.cf. It uses both the direct link between the nodes
>> (br1) and the shared LAN network on br0 for communicating:
>> autojoin none
>> mcast br0 239.0.0.43 694 1 0
>> bcast br1
>> warntime 5
>> deadtime 15
>> initdead 60
>> keepalive 2
>> node node1
>> node node2
>> node quorumnode
>> crm respawn
>> respawn hacluster /usr/lib/heartbeat/dopd
>> apiauth dopd gid=haclient uid=hacluster
>>
>> I am thinking of making the following changes to the CIB (as per the
>> official DRBD
>> guide
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in
>> order to add the DRBD lsb service and require that it start before the
>> ocf:linbit:drbd resources. Does this look correct?
>
> Where did you read that? No, deactivate the startup of DRBD on system
> boot and let Pacemaker manage it completely.
>
>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>> colocation c_drbd_together inf:
>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>> ms_drbd_mount2:Master
>> order drbd_init_first inf: ms_drbd_vmstore:promote
>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>
>> This doesn't seem to require that drbd be also running on the node where
>> the ocf:linbit:drbd resources are slave (which it would need to do to be
>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
>> (clone cl_drbd p_drbd-init ?)
>
> This is really not needed.
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>>
>> Thanks,
>>
>> Andrew
>> ------------------------------------------------------------------------
>> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo>>
>> *To: *pacemaker [at] oss <mailto:*pacemaker [at] oss>
>> *Sent: *Monday, March 26, 2012 5:56:22 PM
>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>> master on failover
>>
>> On 03/24/2012 08:15 PM, Andrew Martin wrote:
>>> Hi Andreas,
>>>
>>> My complete cluster configuration is as follows:
>>> ============
>>> Last updated: Sat Mar 24 13:51:55 2012
>>> Last change: Sat Mar 24 13:41:55 2012
>>> Stack: Heartbeat
>>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
>>> with quorum
>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>>> 3 Nodes configured, unknown expected votes
>>> 19 Resources configured.
>>> ============
>>>
>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
>>> Online: [ node2 node1 ]
>>>
>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>> Masters: [ node2 ]
>>> Slaves: [ node1 ]
>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>> Masters: [ node2 ]
>>> Slaves: [ node1 ]
>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>> Masters: [ node2 ]
>>> Slaves: [ node1 ]
>>> Resource Group: g_vm
>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2
>>> Clone Set: cl_daemons [g_daemons]
>>> Started: [ node2 node1 ]
>>> Stopped: [ g_daemons:2 ]
>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>>> Started: [ node2 node1 ]
>>> Stopped: [ p_sysadmin_notify:2 ]
>>> stonith-node1(stonith:external/tripplitepdu):Started node2
>>> stonith-node2(stonith:external/tripplitepdu):Started node1
>>> Clone Set: cl_ping [p_ping]
>>> Started: [ node2 node1 ]
>>> Stopped: [ p_ping:2 ]
>>>
>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
>>> attributes standby="off"
>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
>>> attributes standby="off"
>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
>>> attributes standby="on"
>>> primitive p_drbd_mount2 ocf:linbit:drbd \
>>> params drbd_resource="mount2" \
>>> op monitor interval="15" role="Master" \
>>> op monitor interval="30" role="Slave"
>>> primitive p_drbd_mount1 ocf:linbit:drbd \
>>> params drbd_resource="mount1" \
>>> op monitor interval="15" role="Master" \
>>> op monitor interval="30" role="Slave"
>>> primitive p_drbd_vmstore ocf:linbit:drbd \
>>> params drbd_resource="vmstore" \
>>> op monitor interval="15" role="Master" \
>>> op monitor interval="30" role="Slave"
>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
>>> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
>>> op start interval="0" timeout="60s" \
>>> op stop interval="0" timeout="60s" \
>>> op monitor interval="20s" timeout="40s"
>>> primitive p_libvirt-bin upstart:libvirt-bin \
>>> op monitor interval="30"
>>> primitive p_ping ocf:pacemaker:ping \
>>> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
>>> multiplier="1000" \
>>> op monitor interval="20s"
>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
>>> params email="me [at] example <mailto:me [at] example>" \
>>> params subject="Pacemaker Change" \
>>> op start interval="0" timeout="10" \
>>> op stop interval="0" timeout="10" \
>>> op monitor interval="10" timeout="10"
>>> primitive p_vm ocf:heartbeat:VirtualDomain \
>>> params config="/vmstore/config/vm.xml" \
>>> meta allow-migrate="false" \
>>> op start interval="0" timeout="120s" \
>>> op stop interval="0" timeout="120s" \
>>> op monitor interval="10" timeout="30"
>>> primitive stonith-node1 stonith:external/tripplitepdu \
>>> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"
>>> pdu_password="xxx" hostname_to_stonith="node1"
>>> primitive stonith-node2 stonith:external/tripplitepdu \
>>> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"
>>> pdu_password="xxx" hostname_to_stonith="node2"
>>> group g_daemons p_libvirt-bin
>>> group g_vm p_fs_vmstore p_vm
>>> ms ms_drbd_mount2 p_drbd_mount2 \
>>> meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1" notify="true"
>>> ms ms_drbd_mount1 p_drbd_mount1 \
>>> meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1" notify="true"
>>> ms ms_drbd_vmstore p_drbd_vmstore \
>>> meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1" notify="true"
>>> clone cl_daemons g_daemons
>>> clone cl_ping p_ping \
>>> meta interleave="true"
>>> clone cl_sysadmin_notify p_sysadmin_notify
>>> location l-st-node1 stonith-node1 -inf: node1
>>> location l-st-node2 stonith-node2 -inf: node2
>>> location l_run_on_most_connected p_vm \
>>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
>>
>> As Emmanuel already said, g_vm has to be in the first place in this
>> collocation constraint .... g_vm must be colocated with the drbd masters.
>>
>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
>>> property $id="cib-bootstrap-options" \
>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>>> cluster-infrastructure="Heartbeat" \
>>> stonith-enabled="false" \
>>> no-quorum-policy="stop" \
>>> last-lrm-refresh="1332539900" \
>>> cluster-recheck-interval="5m" \
>>> crmd-integration-timeout="3m" \
>>> shutdown-escalation="5m"
>>>
>>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
>>> PDUMH20ATNET that I'm using as the STONITH device:
>>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>>
>> And why don't using it? .... stonith-enabled="false"
>>
>>>
>>> As you can see, I left the DRBD service to be started by the operating
>>> system (as an lsb script at boot time) however Pacemaker controls
>>> actually bringing up/taking down the individual DRBD devices.
>>
>> Don't start drbd on system boot, give Pacemaker the full control.
>>
>> The
>>> behavior I observe is as follows: I issue "crm resource migrate p_vm" on
>>> node1 and failover successfully to node2. During this time, node2 fences
>>> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile
>>> node2's DRBD devices are UpToDate. I then shutdown both nodes and then
>>> bring them back up. They reconnect to the cluster (with quorum), and
>>> node1's DRBD devices are still Outdated as expected and node2's DRBD
>>> devices are still UpToDate, as expected. At this point, DRBD starts on
>>> both nodes, however node2 will not set DRBD as master:
>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
>>> Online: [ node2 node1 ]
>>>
>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>> Slaves: [ node1 node2 ]
>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>> Slaves: [ node1 node 2 ]
>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>> Slaves: [ node1 node2 ]
>>
>> There should really be no interruption of the drbd replication on vm
>> migration that activates the dopd ... drbd has its own direct network
>> connection?
>>
>> Please share your ha.cf file and your drbd configuration. Watch out for
>> drbd messages in your kernel log file, that should give you additional
>> information when/why the drbd connection was lost.
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> I am having trouble sorting through the logging information because
>>> there is so much of it in /var/log/daemon.log, but I can't find an
>>> error message printed about why it will not promote node2. At this point
>>> the DRBD devices are as follows:
>>> node2: cstate = WFConnection dstate=UpToDate
>>> node1: cstate = StandAlone dstate=Outdated
>>>
>>> I don't see any reason why node2 can't become DRBD master, or am I
>>> missing something? If I do "drbdadm connect all" on node1, then the
>>> cstate on both nodes changes to "Connected" and node2 immediately
>>> promotes the DRBD resources to master. Any ideas on why I'm observing
>>> this incorrect behavior?
>>>
>>> Any tips on how I can better filter through the pacemaker/heartbeat logs
>>> or how to get additional useful debug information?
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> ------------------------------------------------------------------------
>>> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo>>
>>> *To: *pacemaker [at] oss
> <mailto:*pacemaker [at] oss>
>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>> master on failover
>>>
>>> On 01/25/2012 08:58 PM, Andrew Martin wrote:
>>>> Hello,
>>>>
>>>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6
>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
>>>> the following resources:
>>>> - primitives for DRBD storage devices
>>>> - primitives for mounting the filesystem on the DRBD storage
>>>> - primitives for some mount binds
>>>> - primitive for starting apache
>>>> - primitives for starting samba and nfs servers (following instructions
>>>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
>>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>>>
>>> not enough information ... please share at least your complete cluster
>>> configuration
>>>
>>> Regards,
>>> Andreas
>>>
>>> --
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>>
>>>>
>>>> Perhaps this is best described through the output of crm_mon:
>>>> Online: [ node1 node2 ]
>>>>
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
>>>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2
>> (unmanaged)
>>>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
>>>> (unmanaged) FAILED
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
>>>> (unmanaged) FAILED
>>>> Slaves: [ node2 ]
>>>> Resource Group: g_core
>>>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
>>>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
>>>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
>>>> Resource Group: g_apache
>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
>>>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
>>>> p_apache (ocf::heartbeat:apache): Started node1
>>>> Resource Group: g_fileservers
>>>> p_lsb_smb (lsb:smbd): Started node1
>>>> p_lsb_nmb (lsb:nmbd): Started node1
>>>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
>>>>
>>>> I have read through the Pacemaker Explained
>>>>
>>>
>> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
>>>> documentation, however could not find a way to further debug these
>>>> problems. First, I put node1 into standby mode to attempt failover to
>>>> the other node (node2). Node2 appeared to start the transition to
>>>> master, however it failed to promote the DRBD resources to master (the
>>>> first step). I have attached a copy of this session in commands.log and
>>>> additional excerpts from /var/log/syslog during important steps. I have
>>>> attempted everything I can think of to try and start the DRBD resource
>>>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
>>>> heartbeat) but cannot bring it out of the slave state. However, if I set
>>>> it to unmanaged and then run drbdadm primary all in the terminal,
>>>> pacemaker is satisfied and continues starting the rest of the resources.
>>>> It then failed when attempting to mount the filesystem for mount2, the
>>>> p_fs_mount2 resource. I attempted to mount the filesystem myself and was
>>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
>>>> it mounted. The rest of the resources started as expected until the
>>>> p_exportfs_mount2 resource, which failed as follows:
>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
>>>> (unmanaged) FAILED
>>>>
>>>> I ran cleanup on this and it started, however when running this test
>>>> earlier today no command could successfully start this exportfs
> resource.
>>>>
>>>> How can I configure pacemaker to better resolve these problems and be
>>>> able to bring the node up successfully on its own? What can I check to
>>>> determine why these failures are occuring? /var/log/syslog did not seem
>>>> to contain very much useful information regarding why the failures
>>> occurred.
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>>
>>>>
>>>>
>>>> This body part will be downloaded on demand.
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
> <mailto:Pacemaker [at] oss>
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
> <mailto:Pacemaker [at] oss>
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
> <mailto:Pacemaker [at] oss>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
> <mailto:Pacemaker [at] oss>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> <mailto:Pacemaker [at] oss>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
Attachments: signature.asc (0.22 KB)


amartin at xes-inc

Mar 28, 2012, 7:56 AM

Post #13 of 16 (1672 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

Hi Andreas,


I disabled the DRBD init script and then restarted the slave node (node2). After it came back up, DRBD did not start:

Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending
Online: [ node2 node1 ]


Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
Masters: [ node1 ]
Stopped: [ p_drbd_vmstore:1 ]
Master/Slave Set: ms_drbd_mount1 [p_drbd_tools]
Masters: [ node1 ]
Stopped: [ p_drbd_mount1:1 ]
Master/Slave Set: ms_drbd_mount2 [p_drbdmount2]
Masters: [ node1 ]
Stopped: [ p_drbd_mount2:1 ]
...



root [at] node:~# service drbd status
drbd not loaded


Is there something else I need to change in the CIB to ensure that DRBD is started? All of my DRBD devices are configured like this:

primitive p_drbd_mount2 ocf:linbit:drbd \
params drbd_resource="mount2" \
op monitor interval="15" role="Master" \
op monitor interval="30" role="Slave"

ms ms_drbd_mount2 p_drbd_mount2 \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"


Here is the output from the syslog (grep -i drbd /var/log/syslog):

Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc op=p_drbd_vmstore:1_monitor_0 )
Mar 28 09:24:47 node2 lrmd: [3210]: info: rsc:p_drbd_vmstore:1 probe[2] (pid 3455)
Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc op=p_drbd_mount1:1_monitor_0 )
Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount1:1 probe[3] (pid 3456)
Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc op=p_drbd_mount2:1_monitor_0 )
Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount2:1 probe[4] (pid 3457)
Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING: Couldn't find device [/dev/drbd0]. Expected /dev/??? to exist
Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked: crm_attribute -N node2 -n master-p_drbd_mount2:1 -l reboot -D
Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked: crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l reboot -D
Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked: crm_attribute -N node2 -n master-p_drbd_mount1:1 -l reboot -D
Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[4] on p_drbd_mount2:1 for client 3213: pid 3457 exited with return code 7
Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[2] on p_drbd_vmstore:1 for client 3213: pid 3455 exited with return code 7
Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=10, confirmed=true) not running
Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[3] on p_drbd_mount1:1 for client 3213: pid 3456 exited with return code 7
Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11, confirmed=true) not running
Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=12, confirmed=true) not running


Thanks,


Andrew
----- Original Message -----

From: "Andreas Kurz" <andreas [at] hastexo>
To: pacemaker [at] oss
Sent: Wednesday, March 28, 2012 9:03:06 AM
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover

On 03/28/2012 03:47 PM, Andrew Martin wrote:
> Hi Andreas,
>
>> hmm ... what is that fence-peer script doing? If you want to use
>> resource-level fencing with the help of dopd, activate the
>> drbd-peer-outdater script in the line above ... and double check if the
>> path is correct
> fence-peer is just a wrapper for drbd-peer-outdater that does some
> additional logging. In my testing dopd has been working well.

I see

>
>>> I am thinking of making the following changes to the CIB (as per the
>>> official DRBD
>>> guide
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in
>>> order to add the DRBD lsb service and require that it start before the
>>> ocf:linbit:drbd resources. Does this look correct?
>>
>> Where did you read that? No, deactivate the startup of DRBD on system
>> boot and let Pacemaker manage it completely.
>>
>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>>> colocation c_drbd_together inf:
>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>>> ms_drbd_mount2:Master
>>> order drbd_init_first inf: ms_drbd_vmstore:promote
>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>>
>>> This doesn't seem to require that drbd be also running on the node where
>>> the ocf:linbit:drbd resources are slave (which it would need to do to be
>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
>>> (clone cl_drbd p_drbd-init ?)
>>
>> This is really not needed.
> I was following the official DRBD Users Guide:
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html
>
> If I am understanding your previous message correctly, I do not need to
> add a lsb primitive for the drbd daemon? It will be
> started/stopped/managed automatically by my ocf:linbit:drbd resources
> (and I can remove the /etc/rc* symlinks)?

Yes, you don't need that LSB script when using Pacemaker and should not
let init start it.

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> Thanks,
>
> Andrew
>
> ------------------------------------------------------------------------
> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo>>
> *To: *pacemaker [at] oss <mailto:pacemaker [at] oss>
> *Sent: *Wednesday, March 28, 2012 7:27:34 AM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 03/28/2012 12:13 AM, Andrew Martin wrote:
>> Hi Andreas,
>>
>> Thanks, I've updated the colocation rule to be in the correct order. I
>> also enabled the STONITH resource (this was temporarily disabled before
>> for some additional testing). DRBD has its own network connection over
>> the br1 interface (192.168.5.0/24 network), a direct crossover cable
>> between node1 and node2:
>> global { usage-count no; }
>> common {
>> syncer { rate 110M; }
>> }
>> resource vmstore {
>> protocol C;
>> startup {
>> wfc-timeout 15;
>> degr-wfc-timeout 60;
>> }
>> handlers {
>> #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
>> fence-peer "/usr/local/bin/fence-peer";
>
> hmm ... what is that fence-peer script doing? If you want to use
> resource-level fencing with the help of dopd, activate the
> drbd-peer-outdater script in the line above ... and double check if the
> path is correct
>
>> split-brain "/usr/lib/drbd/notify-split-brain.sh
>> me [at] example <mailto:me [at] example>";
>> }
>> net {
>> after-sb-0pri discard-zero-changes;
>> after-sb-1pri discard-secondary;
>> after-sb-2pri disconnect;
>> cram-hmac-alg md5;
>> shared-secret "xxxxx";
>> }
>> disk {
>> fencing resource-only;
>> }
>> on node1 {
>> device /dev/drbd0;
>> disk /dev/sdb1;
>> address 192.168.5.10:7787;
>> meta-disk internal;
>> }
>> on node2 {
>> device /dev/drbd0;
>> disk /dev/sdf1;
>> address 192.168.5.11:7787;
>> meta-disk internal;
>> }
>> }
>> # and similar for mount1 and mount2
>>
>> Also, here is my ha.cf. It uses both the direct link between the nodes
>> (br1) and the shared LAN network on br0 for communicating:
>> autojoin none
>> mcast br0 239.0.0.43 694 1 0
>> bcast br1
>> warntime 5
>> deadtime 15
>> initdead 60
>> keepalive 2
>> node node1
>> node node2
>> node quorumnode
>> crm respawn
>> respawn hacluster /usr/lib/heartbeat/dopd
>> apiauth dopd gid=haclient uid=hacluster
>>
>> I am thinking of making the following changes to the CIB (as per the
>> official DRBD
>> guide
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in
>> order to add the DRBD lsb service and require that it start before the
>> ocf:linbit:drbd resources. Does this look correct?
>
> Where did you read that? No, deactivate the startup of DRBD on system
> boot and let Pacemaker manage it completely.
>
>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>> colocation c_drbd_together inf:
>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>> ms_drbd_mount2:Master
>> order drbd_init_first inf: ms_drbd_vmstore:promote
>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>
>> This doesn't seem to require that drbd be also running on the node where
>> the ocf:linbit:drbd resources are slave (which it would need to do to be
>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
>> (clone cl_drbd p_drbd-init ?)
>
> This is really not needed.
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>>
>> Thanks,
>>
>> Andrew
>> ------------------------------------------------------------------------
>> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo>>
>> *To: *pacemaker [at] oss <mailto:*pacemaker [at] oss>
>> *Sent: *Monday, March 26, 2012 5:56:22 PM
>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>> master on failover
>>
>> On 03/24/2012 08:15 PM, Andrew Martin wrote:
>>> Hi Andreas,
>>>
>>> My complete cluster configuration is as follows:
>>> ============
>>> Last updated: Sat Mar 24 13:51:55 2012
>>> Last change: Sat Mar 24 13:41:55 2012
>>> Stack: Heartbeat
>>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
>>> with quorum
>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>>> 3 Nodes configured, unknown expected votes
>>> 19 Resources configured.
>>> ============
>>>
>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
>>> Online: [ node2 node1 ]
>>>
>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>> Masters: [ node2 ]
>>> Slaves: [ node1 ]
>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>> Masters: [ node2 ]
>>> Slaves: [ node1 ]
>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>> Masters: [ node2 ]
>>> Slaves: [ node1 ]
>>> Resource Group: g_vm
>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2
>>> Clone Set: cl_daemons [g_daemons]
>>> Started: [ node2 node1 ]
>>> Stopped: [ g_daemons:2 ]
>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>>> Started: [ node2 node1 ]
>>> Stopped: [ p_sysadmin_notify:2 ]
>>> stonith-node1(stonith:external/tripplitepdu):Started node2
>>> stonith-node2(stonith:external/tripplitepdu):Started node1
>>> Clone Set: cl_ping [p_ping]
>>> Started: [ node2 node1 ]
>>> Stopped: [ p_ping:2 ]
>>>
>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
>>> attributes standby="off"
>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
>>> attributes standby="off"
>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
>>> attributes standby="on"
>>> primitive p_drbd_mount2 ocf:linbit:drbd \
>>> params drbd_resource="mount2" \
>>> op monitor interval="15" role="Master" \
>>> op monitor interval="30" role="Slave"
>>> primitive p_drbd_mount1 ocf:linbit:drbd \
>>> params drbd_resource="mount1" \
>>> op monitor interval="15" role="Master" \
>>> op monitor interval="30" role="Slave"
>>> primitive p_drbd_vmstore ocf:linbit:drbd \
>>> params drbd_resource="vmstore" \
>>> op monitor interval="15" role="Master" \
>>> op monitor interval="30" role="Slave"
>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
>>> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
>>> op start interval="0" timeout="60s" \
>>> op stop interval="0" timeout="60s" \
>>> op monitor interval="20s" timeout="40s"
>>> primitive p_libvirt-bin upstart:libvirt-bin \
>>> op monitor interval="30"
>>> primitive p_ping ocf:pacemaker:ping \
>>> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
>>> multiplier="1000" \
>>> op monitor interval="20s"
>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
>>> params email="me [at] example <mailto:me [at] example>" \
>>> params subject="Pacemaker Change" \
>>> op start interval="0" timeout="10" \
>>> op stop interval="0" timeout="10" \
>>> op monitor interval="10" timeout="10"
>>> primitive p_vm ocf:heartbeat:VirtualDomain \
>>> params config="/vmstore/config/vm.xml" \
>>> meta allow-migrate="false" \
>>> op start interval="0" timeout="120s" \
>>> op stop interval="0" timeout="120s" \
>>> op monitor interval="10" timeout="30"
>>> primitive stonith-node1 stonith:external/tripplitepdu \
>>> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"
>>> pdu_password="xxx" hostname_to_stonith="node1"
>>> primitive stonith-node2 stonith:external/tripplitepdu \
>>> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"
>>> pdu_password="xxx" hostname_to_stonith="node2"
>>> group g_daemons p_libvirt-bin
>>> group g_vm p_fs_vmstore p_vm
>>> ms ms_drbd_mount2 p_drbd_mount2 \
>>> meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1" notify="true"
>>> ms ms_drbd_mount1 p_drbd_mount1 \
>>> meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1" notify="true"
>>> ms ms_drbd_vmstore p_drbd_vmstore \
>>> meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1" notify="true"
>>> clone cl_daemons g_daemons
>>> clone cl_ping p_ping \
>>> meta interleave="true"
>>> clone cl_sysadmin_notify p_sysadmin_notify
>>> location l-st-node1 stonith-node1 -inf: node1
>>> location l-st-node2 stonith-node2 -inf: node2
>>> location l_run_on_most_connected p_vm \
>>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
>>
>> As Emmanuel already said, g_vm has to be in the first place in this
>> collocation constraint .... g_vm must be colocated with the drbd masters.
>>
>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
>>> property $id="cib-bootstrap-options" \
>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>>> cluster-infrastructure="Heartbeat" \
>>> stonith-enabled="false" \
>>> no-quorum-policy="stop" \
>>> last-lrm-refresh="1332539900" \
>>> cluster-recheck-interval="5m" \
>>> crmd-integration-timeout="3m" \
>>> shutdown-escalation="5m"
>>>
>>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
>>> PDUMH20ATNET that I'm using as the STONITH device:
>>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>>
>> And why don't using it? .... stonith-enabled="false"
>>
>>>
>>> As you can see, I left the DRBD service to be started by the operating
>>> system (as an lsb script at boot time) however Pacemaker controls
>>> actually bringing up/taking down the individual DRBD devices.
>>
>> Don't start drbd on system boot, give Pacemaker the full control.
>>
>> The
>>> behavior I observe is as follows: I issue "crm resource migrate p_vm" on
>>> node1 and failover successfully to node2. During this time, node2 fences
>>> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile
>>> node2's DRBD devices are UpToDate. I then shutdown both nodes and then
>>> bring them back up. They reconnect to the cluster (with quorum), and
>>> node1's DRBD devices are still Outdated as expected and node2's DRBD
>>> devices are still UpToDate, as expected. At this point, DRBD starts on
>>> both nodes, however node2 will not set DRBD as master:
>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
>>> Online: [ node2 node1 ]
>>>
>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>> Slaves: [ node1 node2 ]
>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>> Slaves: [ node1 node 2 ]
>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>> Slaves: [ node1 node2 ]
>>
>> There should really be no interruption of the drbd replication on vm
>> migration that activates the dopd ... drbd has its own direct network
>> connection?
>>
>> Please share your ha.cf file and your drbd configuration. Watch out for
>> drbd messages in your kernel log file, that should give you additional
>> information when/why the drbd connection was lost.
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> I am having trouble sorting through the logging information because
>>> there is so much of it in /var/log/daemon.log, but I can't find an
>>> error message printed about why it will not promote node2. At this point
>>> the DRBD devices are as follows:
>>> node2: cstate = WFConnection dstate=UpToDate
>>> node1: cstate = StandAlone dstate=Outdated
>>>
>>> I don't see any reason why node2 can't become DRBD master, or am I
>>> missing something? If I do "drbdadm connect all" on node1, then the
>>> cstate on both nodes changes to "Connected" and node2 immediately
>>> promotes the DRBD resources to master. Any ideas on why I'm observing
>>> this incorrect behavior?
>>>
>>> Any tips on how I can better filter through the pacemaker/heartbeat logs
>>> or how to get additional useful debug information?
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> ------------------------------------------------------------------------
>>> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo>>
>>> *To: *pacemaker [at] oss
> <mailto:*pacemaker [at] oss>
>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>> master on failover
>>>
>>> On 01/25/2012 08:58 PM, Andrew Martin wrote:
>>>> Hello,
>>>>
>>>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6
>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
>>>> the following resources:
>>>> - primitives for DRBD storage devices
>>>> - primitives for mounting the filesystem on the DRBD storage
>>>> - primitives for some mount binds
>>>> - primitive for starting apache
>>>> - primitives for starting samba and nfs servers (following instructions
>>>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
>>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>>>
>>> not enough information ... please share at least your complete cluster
>>> configuration
>>>
>>> Regards,
>>> Andreas
>>>
>>> --
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>>
>>>>
>>>> Perhaps this is best described through the output of crm_mon:
>>>> Online: [ node1 node2 ]
>>>>
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
>>>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2
>> (unmanaged)
>>>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
>>>> (unmanaged) FAILED
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
>>>> (unmanaged) FAILED
>>>> Slaves: [ node2 ]
>>>> Resource Group: g_core
>>>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
>>>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
>>>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
>>>> Resource Group: g_apache
>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
>>>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
>>>> p_apache (ocf::heartbeat:apache): Started node1
>>>> Resource Group: g_fileservers
>>>> p_lsb_smb (lsb:smbd): Started node1
>>>> p_lsb_nmb (lsb:nmbd): Started node1
>>>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1
>>>>
>>>> I have read through the Pacemaker Explained
>>>>
>>>
>> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
>>>> documentation, however could not find a way to further debug these
>>>> problems. First, I put node1 into standby mode to attempt failover to
>>>> the other node (node2). Node2 appeared to start the transition to
>>>> master, however it failed to promote the DRBD resources to master (the
>>>> first step). I have attached a copy of this session in commands.log and
>>>> additional excerpts from /var/log/syslog during important steps. I have
>>>> attempted everything I can think of to try and start the DRBD resource
>>>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
>>>> heartbeat) but cannot bring it out of the slave state. However, if I set
>>>> it to unmanaged and then run drbdadm primary all in the terminal,
>>>> pacemaker is satisfied and continues starting the rest of the resources.
>>>> It then failed when attempting to mount the filesystem for mount2, the
>>>> p_fs_mount2 resource. I attempted to mount the filesystem myself and was
>>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
>>>> it mounted. The rest of the resources started as expected until the
>>>> p_exportfs_mount2 resource, which failed as follows:
>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
>>>> (unmanaged) FAILED
>>>>
>>>> I ran cleanup on this and it started, however when running this test
>>>> earlier today no command could successfully start this exportfs
> resource.
>>>>
>>>> How can I configure pacemaker to better resolve these problems and be
>>>> able to bring the node up successfully on its own? What can I check to
>>>> determine why these failures are occuring? /var/log/syslog did not seem
>>>> to contain very much useful information regarding why the failures
>>> occurred.
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>>
>>>>
>>>>
>>>> This body part will be downloaded on demand.
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
> <mailto:Pacemaker [at] oss>
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
> <mailto:Pacemaker [at] oss>
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
> <mailto:Pacemaker [at] oss>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
> <mailto:Pacemaker [at] oss>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> <mailto:Pacemaker [at] oss>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andreas at hastexo

Mar 30, 2012, 2:41 AM

Post #14 of 16 (1660 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

On 03/28/2012 04:56 PM, Andrew Martin wrote:
> Hi Andreas,
>
> I disabled the DRBD init script and then restarted the slave node
> (node2). After it came back up, DRBD did not start:
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending
> Online: [ node2 node1 ]
>
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> Masters: [ node1 ]
> Stopped: [ p_drbd_vmstore:1 ]
> Master/Slave Set: ms_drbd_mount1 [p_drbd_tools]
> Masters: [ node1 ]
> Stopped: [ p_drbd_mount1:1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbdmount2]
> Masters: [ node1 ]
> Stopped: [ p_drbd_mount2:1 ]
> ...
>
> root [at] node:~# service drbd status
> drbd not loaded

Yes, expected unless Pacemaker starts DRBD

>
> Is there something else I need to change in the CIB to ensure that DRBD
> is started? All of my DRBD devices are configured like this:
> primitive p_drbd_mount2 ocf:linbit:drbd \
> params drbd_resource="mount2" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> ms ms_drbd_mount2 p_drbd_mount2 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"

That should be enough ... unable to say more without seeing the complete
configuration ... too much fragments of information ;-)

Please provide (e.g. pastebin) your complete cib (cibadmin -Q) when
cluster is in that state ... or even better create a crm_report archive

>
> Here is the output from the syslog (grep -i drbd /var/log/syslog):
> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
> key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> op=p_drbd_vmstore:1_monitor_0 )
> Mar 28 09:24:47 node2 lrmd: [3210]: info: rsc:p_drbd_vmstore:1 probe[2]
> (pid 3455)
> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
> key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> op=p_drbd_mount1:1_monitor_0 )
> Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount1:1 probe[3]
> (pid 3456)
> Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
> key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> op=p_drbd_mount2:1_monitor_0 )
> Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount2:1 probe[4]
> (pid 3457)
> Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING: Couldn't find
> device [/dev/drbd0]. Expected /dev/??? to exist
> Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked:
> crm_attribute -N node2 -n master-p_drbd_mount2:1 -l reboot -D
> Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked:
> crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l reboot -D
> Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked:
> crm_attribute -N node2 -n master-p_drbd_mount1:1 -l reboot -D
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[4] on
> p_drbd_mount2:1 for client 3213: pid 3457 exited with return code 7
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[2] on
> p_drbd_vmstore:1 for client 3213: pid 3455 exited with return code 7
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=10,
> confirmed=true) not running
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[3] on
> p_drbd_mount1:1 for client 3213: pid 3456 exited with return code 7
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11,
> confirmed=true) not running
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=12,
> confirmed=true) not running

No errors, just probing ... so for any reason Pacemaker does not like to
start it ... use crm_simulate to find out why ... or provide information
as requested above.

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> Thanks,
>
> Andrew
>
> ------------------------------------------------------------------------
> *From: *"Andreas Kurz" <andreas [at] hastexo>
> *To: *pacemaker [at] oss
> *Sent: *Wednesday, March 28, 2012 9:03:06 AM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 03/28/2012 03:47 PM, Andrew Martin wrote:
>> Hi Andreas,
>>
>>> hmm ... what is that fence-peer script doing? If you want to use
>>> resource-level fencing with the help of dopd, activate the
>>> drbd-peer-outdater script in the line above ... and double check if the
>>> path is correct
>> fence-peer is just a wrapper for drbd-peer-outdater that does some
>> additional logging. In my testing dopd has been working well.
>
> I see
>
>>
>>>> I am thinking of making the following changes to the CIB (as per the
>>>> official DRBD
>>>> guide
>>
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in
>>>> order to add the DRBD lsb service and require that it start before the
>>>> ocf:linbit:drbd resources. Does this look correct?
>>>
>>> Where did you read that? No, deactivate the startup of DRBD on system
>>> boot and let Pacemaker manage it completely.
>>>
>>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>>>> colocation c_drbd_together inf:
>>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>>>> ms_drbd_mount2:Master
>>>> order drbd_init_first inf: ms_drbd_vmstore:promote
>>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>>>
>>>> This doesn't seem to require that drbd be also running on the node where
>>>> the ocf:linbit:drbd resources are slave (which it would need to do to be
>>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
>>>> (clone cl_drbd p_drbd-init ?)
>>>
>>> This is really not needed.
>> I was following the official DRBD Users Guide:
>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html
>>
>> If I am understanding your previous message correctly, I do not need to
>> add a lsb primitive for the drbd daemon? It will be
>> started/stopped/managed automatically by my ocf:linbit:drbd resources
>> (and I can remove the /etc/rc* symlinks)?
>
> Yes, you don't need that LSB script when using Pacemaker and should not
> let init start it.
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>>
>> Thanks,
>>
>> Andrew
>>
>> ------------------------------------------------------------------------
>> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo>>
>> *To: *pacemaker [at] oss <mailto:pacemaker [at] oss>
>> *Sent: *Wednesday, March 28, 2012 7:27:34 AM
>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>> master on failover
>>
>> On 03/28/2012 12:13 AM, Andrew Martin wrote:
>>> Hi Andreas,
>>>
>>> Thanks, I've updated the colocation rule to be in the correct order. I
>>> also enabled the STONITH resource (this was temporarily disabled before
>>> for some additional testing). DRBD has its own network connection over
>>> the br1 interface (192.168.5.0/24 network), a direct crossover cable
>>> between node1 and node2:
>>> global { usage-count no; }
>>> common {
>>> syncer { rate 110M; }
>>> }
>>> resource vmstore {
>>> protocol C;
>>> startup {
>>> wfc-timeout 15;
>>> degr-wfc-timeout 60;
>>> }
>>> handlers {
>>> #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
>>> fence-peer "/usr/local/bin/fence-peer";
>>
>> hmm ... what is that fence-peer script doing? If you want to use
>> resource-level fencing with the help of dopd, activate the
>> drbd-peer-outdater script in the line above ... and double check if the
>> path is correct
>>
>>> split-brain "/usr/lib/drbd/notify-split-brain.sh
>>> me [at] example <mailto:me [at] example>";
>>> }
>>> net {
>>> after-sb-0pri discard-zero-changes;
>>> after-sb-1pri discard-secondary;
>>> after-sb-2pri disconnect;
>>> cram-hmac-alg md5;
>>> shared-secret "xxxxx";
>>> }
>>> disk {
>>> fencing resource-only;
>>> }
>>> on node1 {
>>> device /dev/drbd0;
>>> disk /dev/sdb1;
>>> address 192.168.5.10:7787;
>>> meta-disk internal;
>>> }
>>> on node2 {
>>> device /dev/drbd0;
>>> disk /dev/sdf1;
>>> address 192.168.5.11:7787;
>>> meta-disk internal;
>>> }
>>> }
>>> # and similar for mount1 and mount2
>>>
>>> Also, here is my ha.cf. It uses both the direct link between the nodes
>>> (br1) and the shared LAN network on br0 for communicating:
>>> autojoin none
>>> mcast br0 239.0.0.43 694 1 0
>>> bcast br1
>>> warntime 5
>>> deadtime 15
>>> initdead 60
>>> keepalive 2
>>> node node1
>>> node node2
>>> node quorumnode
>>> crm respawn
>>> respawn hacluster /usr/lib/heartbeat/dopd
>>> apiauth dopd gid=haclient uid=hacluster
>>>
>>> I am thinking of making the following changes to the CIB (as per the
>>> official DRBD
>>> guide
>>
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in
>>> order to add the DRBD lsb service and require that it start before the
>>> ocf:linbit:drbd resources. Does this look correct?
>>
>> Where did you read that? No, deactivate the startup of DRBD on system
>> boot and let Pacemaker manage it completely.
>>
>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>>> colocation c_drbd_together inf:
>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>>> ms_drbd_mount2:Master
>>> order drbd_init_first inf: ms_drbd_vmstore:promote
>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>>
>>> This doesn't seem to require that drbd be also running on the node where
>>> the ocf:linbit:drbd resources are slave (which it would need to do to be
>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
>>> (clone cl_drbd p_drbd-init ?)
>>
>> This is really not needed.
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> Thanks,
>>>
>>> Andrew
>>> ------------------------------------------------------------------------
>>> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo>>
>>> *To: *pacemaker [at] oss
> <mailto:*pacemaker [at] oss>
>>> *Sent: *Monday, March 26, 2012 5:56:22 PM
>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>> master on failover
>>>
>>> On 03/24/2012 08:15 PM, Andrew Martin wrote:
>>>> Hi Andreas,
>>>>
>>>> My complete cluster configuration is as follows:
>>>> ============
>>>> Last updated: Sat Mar 24 13:51:55 2012
>>>> Last change: Sat Mar 24 13:41:55 2012
>>>> Stack: Heartbeat
>>>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
>>>> with quorum
>>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>>>> 3 Nodes configured, unknown expected votes
>>>> 19 Resources configured.
>>>> ============
>>>>
>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE
> (standby)
>>>> Online: [ node2 node1 ]
>>>>
>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>>> Masters: [ node2 ]
>>>> Slaves: [ node1 ]
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>>> Masters: [ node2 ]
>>>> Slaves: [ node1 ]
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>> Masters: [ node2 ]
>>>> Slaves: [ node1 ]
>>>> Resource Group: g_vm
>>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
>>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2
>>>> Clone Set: cl_daemons [g_daemons]
>>>> Started: [ node2 node1 ]
>>>> Stopped: [ g_daemons:2 ]
>>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>>>> Started: [ node2 node1 ]
>>>> Stopped: [ p_sysadmin_notify:2 ]
>>>> stonith-node1(stonith:external/tripplitepdu):Started node2
>>>> stonith-node2(stonith:external/tripplitepdu):Started node1
>>>> Clone Set: cl_ping [p_ping]
>>>> Started: [ node2 node1 ]
>>>> Stopped: [ p_ping:2 ]
>>>>
>>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
>>>> attributes standby="off"
>>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
>>>> attributes standby="off"
>>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
>>>> attributes standby="on"
>>>> primitive p_drbd_mount2 ocf:linbit:drbd \
>>>> params drbd_resource="mount2" \
>>>> op monitor interval="15" role="Master" \
>>>> op monitor interval="30" role="Slave"
>>>> primitive p_drbd_mount1 ocf:linbit:drbd \
>>>> params drbd_resource="mount1" \
>>>> op monitor interval="15" role="Master" \
>>>> op monitor interval="30" role="Slave"
>>>> primitive p_drbd_vmstore ocf:linbit:drbd \
>>>> params drbd_resource="vmstore" \
>>>> op monitor interval="15" role="Master" \
>>>> op monitor interval="30" role="Slave"
>>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
>>>> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
>>>> op start interval="0" timeout="60s" \
>>>> op stop interval="0" timeout="60s" \
>>>> op monitor interval="20s" timeout="40s"
>>>> primitive p_libvirt-bin upstart:libvirt-bin \
>>>> op monitor interval="30"
>>>> primitive p_ping ocf:pacemaker:ping \
>>>> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
>>>> multiplier="1000" \
>>>> op monitor interval="20s"
>>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
>>>> params email="me [at] example <mailto:me [at] example>" \
>>>> params subject="Pacemaker Change" \
>>>> op start interval="0" timeout="10" \
>>>> op stop interval="0" timeout="10" \
>>>> op monitor interval="10" timeout="10"
>>>> primitive p_vm ocf:heartbeat:VirtualDomain \
>>>> params config="/vmstore/config/vm.xml" \
>>>> meta allow-migrate="false" \
>>>> op start interval="0" timeout="120s" \
>>>> op stop interval="0" timeout="120s" \
>>>> op monitor interval="10" timeout="30"
>>>> primitive stonith-node1 stonith:external/tripplitepdu \
>>>> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"
>>>> pdu_password="xxx" hostname_to_stonith="node1"
>>>> primitive stonith-node2 stonith:external/tripplitepdu \
>>>> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"
>>>> pdu_password="xxx" hostname_to_stonith="node2"
>>>> group g_daemons p_libvirt-bin
>>>> group g_vm p_fs_vmstore p_vm
>>>> ms ms_drbd_mount2 p_drbd_mount2 \
>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>> clone-node-max="1" notify="true"
>>>> ms ms_drbd_mount1 p_drbd_mount1 \
>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>> clone-node-max="1" notify="true"
>>>> ms ms_drbd_vmstore p_drbd_vmstore \
>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>> clone-node-max="1" notify="true"
>>>> clone cl_daemons g_daemons
>>>> clone cl_ping p_ping \
>>>> meta interleave="true"
>>>> clone cl_sysadmin_notify p_sysadmin_notify
>>>> location l-st-node1 stonith-node1 -inf: node1
>>>> location l-st-node2 stonith-node2 -inf: node2
>>>> location l_run_on_most_connected p_vm \
>>>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
>>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
>>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
>>>
>>> As Emmanuel already said, g_vm has to be in the first place in this
>>> collocation constraint .... g_vm must be colocated with the drbd masters.
>>>
>>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
>>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
>>>> property $id="cib-bootstrap-options" \
>>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>>>> cluster-infrastructure="Heartbeat" \
>>>> stonith-enabled="false" \
>>>> no-quorum-policy="stop" \
>>>> last-lrm-refresh="1332539900" \
>>>> cluster-recheck-interval="5m" \
>>>> crmd-integration-timeout="3m" \
>>>> shutdown-escalation="5m"
>>>>
>>>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
>>>> PDUMH20ATNET that I'm using as the STONITH device:
>>>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>>>
>>> And why don't using it? .... stonith-enabled="false"
>>>
>>>>
>>>> As you can see, I left the DRBD service to be started by the operating
>>>> system (as an lsb script at boot time) however Pacemaker controls
>>>> actually bringing up/taking down the individual DRBD devices.
>>>
>>> Don't start drbd on system boot, give Pacemaker the full control.
>>>
>>> The
>>>> behavior I observe is as follows: I issue "crm resource migrate p_vm" on
>>>> node1 and failover successfully to node2. During this time, node2 fences
>>>> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile
>>>> node2's DRBD devices are UpToDate. I then shutdown both nodes and then
>>>> bring them back up. They reconnect to the cluster (with quorum), and
>>>> node1's DRBD devices are still Outdated as expected and node2's DRBD
>>>> devices are still UpToDate, as expected. At this point, DRBD starts on
>>>> both nodes, however node2 will not set DRBD as master:
>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE
> (standby)
>>>> Online: [ node2 node1 ]
>>>>
>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>>> Slaves: [ node1 node2 ]
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>>> Slaves: [ node1 node 2 ]
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>> Slaves: [ node1 node2 ]
>>>
>>> There should really be no interruption of the drbd replication on vm
>>> migration that activates the dopd ... drbd has its own direct network
>>> connection?
>>>
>>> Please share your ha.cf file and your drbd configuration. Watch out for
>>> drbd messages in your kernel log file, that should give you additional
>>> information when/why the drbd connection was lost.
>>>
>>> Regards,
>>> Andreas
>>>
>>> --
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>>
>>>>
>>>> I am having trouble sorting through the logging information because
>>>> there is so much of it in /var/log/daemon.log, but I can't find an
>>>> error message printed about why it will not promote node2. At this point
>>>> the DRBD devices are as follows:
>>>> node2: cstate = WFConnection dstate=UpToDate
>>>> node1: cstate = StandAlone dstate=Outdated
>>>>
>>>> I don't see any reason why node2 can't become DRBD master, or am I
>>>> missing something? If I do "drbdadm connect all" on node1, then the
>>>> cstate on both nodes changes to "Connected" and node2 immediately
>>>> promotes the DRBD resources to master. Any ideas on why I'm observing
>>>> this incorrect behavior?
>>>>
>>>> Any tips on how I can better filter through the pacemaker/heartbeat logs
>>>> or how to get additional useful debug information?
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From: *"Andreas Kurz" <andreas [at] hastexo
> <mailto:andreas [at] hastexo>>
>>>> *To: *pacemaker [at] oss
>> <mailto:*pacemaker [at] oss>
>>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>>> master on failover
>>>>
>>>> On 01/25/2012 08:58 PM, Andrew Martin wrote:
>>>>> Hello,
>>>>>
>>>>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6
>>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster
> includes
>>>>> the following resources:
>>>>> - primitives for DRBD storage devices
>>>>> - primitives for mounting the filesystem on the DRBD storage
>>>>> - primitives for some mount binds
>>>>> - primitive for starting apache
>>>>> - primitives for starting samba and nfs servers (following instructions
>>>>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
>>>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>>>>
>>>> not enough information ... please share at least your complete cluster
>>>> configuration
>>>>
>>>> Regards,
>>>> Andreas
>>>>
>>>> --
>>>> Need help with Pacemaker?
>>>> http://www.hastexo.com/now
>>>>
>>>>>
>>>>> Perhaps this is best described through the output of crm_mon:
>>>>> Online: [ node1 node2 ]
>>>>>
>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
>>>>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2
>>> (unmanaged)
>>>>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
>>>>> (unmanaged) FAILED
>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
>>>>> (unmanaged) FAILED
>>>>> Slaves: [ node2 ]
>>>>> Resource Group: g_core
>>>>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
>>>>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
>>>>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
>>>>> Resource Group: g_apache
>>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
>>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
>>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
>>>>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
>>>>> p_apache (ocf::heartbeat:apache): Started node1
>>>>> Resource Group: g_fileservers
>>>>> p_lsb_smb (lsb:smbd): Started node1
>>>>> p_lsb_nmb (lsb:nmbd): Started node1
>>>>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
>>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started
> node1
>>>>>
>>>>> I have read through the Pacemaker Explained
>>>>>
>>>>
>>>
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
>>>>> documentation, however could not find a way to further debug these
>>>>> problems. First, I put node1 into standby mode to attempt failover to
>>>>> the other node (node2). Node2 appeared to start the transition to
>>>>> master, however it failed to promote the DRBD resources to master (the
>>>>> first step). I have attached a copy of this session in commands.log and
>>>>> additional excerpts from /var/log/syslog during important steps. I have
>>>>> attempted everything I can think of to try and start the DRBD resource
>>>>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
>>>>> heartbeat) but cannot bring it out of the slave state. However, if
> I set
>>>>> it to unmanaged and then run drbdadm primary all in the terminal,
>>>>> pacemaker is satisfied and continues starting the rest of the
> resources.
>>>>> It then failed when attempting to mount the filesystem for mount2, the
>>>>> p_fs_mount2 resource. I attempted to mount the filesystem myself
> and was
>>>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
>>>>> it mounted. The rest of the resources started as expected until the
>>>>> p_exportfs_mount2 resource, which failed as follows:
>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
>>>>> (unmanaged) FAILED
>>>>>
>>>>> I ran cleanup on this and it started, however when running this test
>>>>> earlier today no command could successfully start this exportfs
>> resource.
>>>>>
>>>>> How can I configure pacemaker to better resolve these problems and be
>>>>> able to bring the node up successfully on its own? What can I check to
>>>>> determine why these failures are occuring? /var/log/syslog did not seem
>>>>> to contain very much useful information regarding why the failures
>>>> occurred.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This body part will be downloaded on demand.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker [at] oss
>> <mailto:Pacemaker [at] oss>
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker [at] oss
>> <mailto:Pacemaker [at] oss>
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
>> <mailto:Pacemaker [at] oss>
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
>> <mailto:Pacemaker [at] oss>
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> <mailto:Pacemaker [at] oss>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
Attachments: signature.asc (0.22 KB)


amartin at xes-inc

Mar 30, 2012, 7:10 AM

Post #15 of 16 (1660 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

Hi Andreas,


Here is a copy of my complete CIB:
http://pastebin.com/v5wHVFuy


I'll work on generating a report using crm_report as well.


Thanks,


Andrew

----- Original Message -----

From: "Andreas Kurz" <andreas [at] hastexo>
To: pacemaker [at] oss
Sent: Friday, March 30, 2012 4:41:16 AM
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover

On 03/28/2012 04:56 PM, Andrew Martin wrote:
> Hi Andreas,
>
> I disabled the DRBD init script and then restarted the slave node
> (node2). After it came back up, DRBD did not start:
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending
> Online: [ node2 node1 ]
>
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> Masters: [ node1 ]
> Stopped: [ p_drbd_vmstore:1 ]
> Master/Slave Set: ms_drbd_mount1 [p_drbd_tools]
> Masters: [ node1 ]
> Stopped: [ p_drbd_mount1:1 ]
> Master/Slave Set: ms_drbd_mount2 [p_drbdmount2]
> Masters: [ node1 ]
> Stopped: [ p_drbd_mount2:1 ]
> ...
>
> root [at] node:~# service drbd status
> drbd not loaded

Yes, expected unless Pacemaker starts DRBD

>
> Is there something else I need to change in the CIB to ensure that DRBD
> is started? All of my DRBD devices are configured like this:
> primitive p_drbd_mount2 ocf:linbit:drbd \
> params drbd_resource="mount2" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> ms ms_drbd_mount2 p_drbd_mount2 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"

That should be enough ... unable to say more without seeing the complete
configuration ... too much fragments of information ;-)

Please provide (e.g. pastebin) your complete cib (cibadmin -Q) when
cluster is in that state ... or even better create a crm_report archive

>
> Here is the output from the syslog (grep -i drbd /var/log/syslog):
> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
> key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> op=p_drbd_vmstore:1_monitor_0 )
> Mar 28 09:24:47 node2 lrmd: [3210]: info: rsc:p_drbd_vmstore:1 probe[2]
> (pid 3455)
> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
> key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> op=p_drbd_mount1:1_monitor_0 )
> Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount1:1 probe[3]
> (pid 3456)
> Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
> key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> op=p_drbd_mount2:1_monitor_0 )
> Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount2:1 probe[4]
> (pid 3457)
> Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING: Couldn't find
> device [/dev/drbd0]. Expected /dev/??? to exist
> Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked:
> crm_attribute -N node2 -n master-p_drbd_mount2:1 -l reboot -D
> Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked:
> crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l reboot -D
> Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked:
> crm_attribute -N node2 -n master-p_drbd_mount1:1 -l reboot -D
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[4] on
> p_drbd_mount2:1 for client 3213: pid 3457 exited with return code 7
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[2] on
> p_drbd_vmstore:1 for client 3213: pid 3455 exited with return code 7
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=10,
> confirmed=true) not running
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[3] on
> p_drbd_mount1:1 for client 3213: pid 3456 exited with return code 7
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11,
> confirmed=true) not running
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=12,
> confirmed=true) not running

No errors, just probing ... so for any reason Pacemaker does not like to
start it ... use crm_simulate to find out why ... or provide information
as requested above.

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> Thanks,
>
> Andrew
>
> ------------------------------------------------------------------------
> *From: *"Andreas Kurz" <andreas [at] hastexo>
> *To: *pacemaker [at] oss
> *Sent: *Wednesday, March 28, 2012 9:03:06 AM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 03/28/2012 03:47 PM, Andrew Martin wrote:
>> Hi Andreas,
>>
>>> hmm ... what is that fence-peer script doing? If you want to use
>>> resource-level fencing with the help of dopd, activate the
>>> drbd-peer-outdater script in the line above ... and double check if the
>>> path is correct
>> fence-peer is just a wrapper for drbd-peer-outdater that does some
>> additional logging. In my testing dopd has been working well.
>
> I see
>
>>
>>>> I am thinking of making the following changes to the CIB (as per the
>>>> official DRBD
>>>> guide
>>
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in
>>>> order to add the DRBD lsb service and require that it start before the
>>>> ocf:linbit:drbd resources. Does this look correct?
>>>
>>> Where did you read that? No, deactivate the startup of DRBD on system
>>> boot and let Pacemaker manage it completely.
>>>
>>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>>>> colocation c_drbd_together inf:
>>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>>>> ms_drbd_mount2:Master
>>>> order drbd_init_first inf: ms_drbd_vmstore:promote
>>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>>>
>>>> This doesn't seem to require that drbd be also running on the node where
>>>> the ocf:linbit:drbd resources are slave (which it would need to do to be
>>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
>>>> (clone cl_drbd p_drbd-init ?)
>>>
>>> This is really not needed.
>> I was following the official DRBD Users Guide:
>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html
>>
>> If I am understanding your previous message correctly, I do not need to
>> add a lsb primitive for the drbd daemon? It will be
>> started/stopped/managed automatically by my ocf:linbit:drbd resources
>> (and I can remove the /etc/rc* symlinks)?
>
> Yes, you don't need that LSB script when using Pacemaker and should not
> let init start it.
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>>
>> Thanks,
>>
>> Andrew
>>
>> ------------------------------------------------------------------------
>> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo>>
>> *To: *pacemaker [at] oss <mailto:pacemaker [at] oss>
>> *Sent: *Wednesday, March 28, 2012 7:27:34 AM
>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>> master on failover
>>
>> On 03/28/2012 12:13 AM, Andrew Martin wrote:
>>> Hi Andreas,
>>>
>>> Thanks, I've updated the colocation rule to be in the correct order. I
>>> also enabled the STONITH resource (this was temporarily disabled before
>>> for some additional testing). DRBD has its own network connection over
>>> the br1 interface (192.168.5.0/24 network), a direct crossover cable
>>> between node1 and node2:
>>> global { usage-count no; }
>>> common {
>>> syncer { rate 110M; }
>>> }
>>> resource vmstore {
>>> protocol C;
>>> startup {
>>> wfc-timeout 15;
>>> degr-wfc-timeout 60;
>>> }
>>> handlers {
>>> #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
>>> fence-peer "/usr/local/bin/fence-peer";
>>
>> hmm ... what is that fence-peer script doing? If you want to use
>> resource-level fencing with the help of dopd, activate the
>> drbd-peer-outdater script in the line above ... and double check if the
>> path is correct
>>
>>> split-brain "/usr/lib/drbd/notify-split-brain.sh
>>> me [at] example <mailto:me [at] example>";
>>> }
>>> net {
>>> after-sb-0pri discard-zero-changes;
>>> after-sb-1pri discard-secondary;
>>> after-sb-2pri disconnect;
>>> cram-hmac-alg md5;
>>> shared-secret "xxxxx";
>>> }
>>> disk {
>>> fencing resource-only;
>>> }
>>> on node1 {
>>> device /dev/drbd0;
>>> disk /dev/sdb1;
>>> address 192.168.5.10:7787;
>>> meta-disk internal;
>>> }
>>> on node2 {
>>> device /dev/drbd0;
>>> disk /dev/sdf1;
>>> address 192.168.5.11:7787;
>>> meta-disk internal;
>>> }
>>> }
>>> # and similar for mount1 and mount2
>>>
>>> Also, here is my ha.cf. It uses both the direct link between the nodes
>>> (br1) and the shared LAN network on br0 for communicating:
>>> autojoin none
>>> mcast br0 239.0.0.43 694 1 0
>>> bcast br1
>>> warntime 5
>>> deadtime 15
>>> initdead 60
>>> keepalive 2
>>> node node1
>>> node node2
>>> node quorumnode
>>> crm respawn
>>> respawn hacluster /usr/lib/heartbeat/dopd
>>> apiauth dopd gid=haclient uid=hacluster
>>>
>>> I am thinking of making the following changes to the CIB (as per the
>>> official DRBD
>>> guide
>>
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in
>>> order to add the DRBD lsb service and require that it start before the
>>> ocf:linbit:drbd resources. Does this look correct?
>>
>> Where did you read that? No, deactivate the startup of DRBD on system
>> boot and let Pacemaker manage it completely.
>>
>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>>> colocation c_drbd_together inf:
>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>>> ms_drbd_mount2:Master
>>> order drbd_init_first inf: ms_drbd_vmstore:promote
>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>>
>>> This doesn't seem to require that drbd be also running on the node where
>>> the ocf:linbit:drbd resources are slave (which it would need to do to be
>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
>>> (clone cl_drbd p_drbd-init ?)
>>
>> This is really not needed.
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> Thanks,
>>>
>>> Andrew
>>> ------------------------------------------------------------------------
>>> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo>>
>>> *To: *pacemaker [at] oss
> <mailto:*pacemaker [at] oss>
>>> *Sent: *Monday, March 26, 2012 5:56:22 PM
>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>> master on failover
>>>
>>> On 03/24/2012 08:15 PM, Andrew Martin wrote:
>>>> Hi Andreas,
>>>>
>>>> My complete cluster configuration is as follows:
>>>> ============
>>>> Last updated: Sat Mar 24 13:51:55 2012
>>>> Last change: Sat Mar 24 13:41:55 2012
>>>> Stack: Heartbeat
>>>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
>>>> with quorum
>>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>>>> 3 Nodes configured, unknown expected votes
>>>> 19 Resources configured.
>>>> ============
>>>>
>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE
> (standby)
>>>> Online: [ node2 node1 ]
>>>>
>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>>> Masters: [ node2 ]
>>>> Slaves: [ node1 ]
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>>> Masters: [ node2 ]
>>>> Slaves: [ node1 ]
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>> Masters: [ node2 ]
>>>> Slaves: [ node1 ]
>>>> Resource Group: g_vm
>>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
>>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2
>>>> Clone Set: cl_daemons [g_daemons]
>>>> Started: [ node2 node1 ]
>>>> Stopped: [ g_daemons:2 ]
>>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>>>> Started: [ node2 node1 ]
>>>> Stopped: [ p_sysadmin_notify:2 ]
>>>> stonith-node1(stonith:external/tripplitepdu):Started node2
>>>> stonith-node2(stonith:external/tripplitepdu):Started node1
>>>> Clone Set: cl_ping [p_ping]
>>>> Started: [ node2 node1 ]
>>>> Stopped: [ p_ping:2 ]
>>>>
>>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
>>>> attributes standby="off"
>>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
>>>> attributes standby="off"
>>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
>>>> attributes standby="on"
>>>> primitive p_drbd_mount2 ocf:linbit:drbd \
>>>> params drbd_resource="mount2" \
>>>> op monitor interval="15" role="Master" \
>>>> op monitor interval="30" role="Slave"
>>>> primitive p_drbd_mount1 ocf:linbit:drbd \
>>>> params drbd_resource="mount1" \
>>>> op monitor interval="15" role="Master" \
>>>> op monitor interval="30" role="Slave"
>>>> primitive p_drbd_vmstore ocf:linbit:drbd \
>>>> params drbd_resource="vmstore" \
>>>> op monitor interval="15" role="Master" \
>>>> op monitor interval="30" role="Slave"
>>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
>>>> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
>>>> op start interval="0" timeout="60s" \
>>>> op stop interval="0" timeout="60s" \
>>>> op monitor interval="20s" timeout="40s"
>>>> primitive p_libvirt-bin upstart:libvirt-bin \
>>>> op monitor interval="30"
>>>> primitive p_ping ocf:pacemaker:ping \
>>>> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
>>>> multiplier="1000" \
>>>> op monitor interval="20s"
>>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
>>>> params email="me [at] example <mailto:me [at] example>" \
>>>> params subject="Pacemaker Change" \
>>>> op start interval="0" timeout="10" \
>>>> op stop interval="0" timeout="10" \
>>>> op monitor interval="10" timeout="10"
>>>> primitive p_vm ocf:heartbeat:VirtualDomain \
>>>> params config="/vmstore/config/vm.xml" \
>>>> meta allow-migrate="false" \
>>>> op start interval="0" timeout="120s" \
>>>> op stop interval="0" timeout="120s" \
>>>> op monitor interval="10" timeout="30"
>>>> primitive stonith-node1 stonith:external/tripplitepdu \
>>>> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"
>>>> pdu_password="xxx" hostname_to_stonith="node1"
>>>> primitive stonith-node2 stonith:external/tripplitepdu \
>>>> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"
>>>> pdu_password="xxx" hostname_to_stonith="node2"
>>>> group g_daemons p_libvirt-bin
>>>> group g_vm p_fs_vmstore p_vm
>>>> ms ms_drbd_mount2 p_drbd_mount2 \
>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>> clone-node-max="1" notify="true"
>>>> ms ms_drbd_mount1 p_drbd_mount1 \
>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>> clone-node-max="1" notify="true"
>>>> ms ms_drbd_vmstore p_drbd_vmstore \
>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>> clone-node-max="1" notify="true"
>>>> clone cl_daemons g_daemons
>>>> clone cl_ping p_ping \
>>>> meta interleave="true"
>>>> clone cl_sysadmin_notify p_sysadmin_notify
>>>> location l-st-node1 stonith-node1 -inf: node1
>>>> location l-st-node2 stonith-node2 -inf: node2
>>>> location l_run_on_most_connected p_vm \
>>>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
>>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
>>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
>>>
>>> As Emmanuel already said, g_vm has to be in the first place in this
>>> collocation constraint .... g_vm must be colocated with the drbd masters.
>>>
>>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
>>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
>>>> property $id="cib-bootstrap-options" \
>>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>>>> cluster-infrastructure="Heartbeat" \
>>>> stonith-enabled="false" \
>>>> no-quorum-policy="stop" \
>>>> last-lrm-refresh="1332539900" \
>>>> cluster-recheck-interval="5m" \
>>>> crmd-integration-timeout="3m" \
>>>> shutdown-escalation="5m"
>>>>
>>>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
>>>> PDUMH20ATNET that I'm using as the STONITH device:
>>>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>>>
>>> And why don't using it? .... stonith-enabled="false"
>>>
>>>>
>>>> As you can see, I left the DRBD service to be started by the operating
>>>> system (as an lsb script at boot time) however Pacemaker controls
>>>> actually bringing up/taking down the individual DRBD devices.
>>>
>>> Don't start drbd on system boot, give Pacemaker the full control.
>>>
>>> The
>>>> behavior I observe is as follows: I issue "crm resource migrate p_vm" on
>>>> node1 and failover successfully to node2. During this time, node2 fences
>>>> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile
>>>> node2's DRBD devices are UpToDate. I then shutdown both nodes and then
>>>> bring them back up. They reconnect to the cluster (with quorum), and
>>>> node1's DRBD devices are still Outdated as expected and node2's DRBD
>>>> devices are still UpToDate, as expected. At this point, DRBD starts on
>>>> both nodes, however node2 will not set DRBD as master:
>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE
> (standby)
>>>> Online: [ node2 node1 ]
>>>>
>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>>> Slaves: [ node1 node2 ]
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>>> Slaves: [ node1 node 2 ]
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>> Slaves: [ node1 node2 ]
>>>
>>> There should really be no interruption of the drbd replication on vm
>>> migration that activates the dopd ... drbd has its own direct network
>>> connection?
>>>
>>> Please share your ha.cf file and your drbd configuration. Watch out for
>>> drbd messages in your kernel log file, that should give you additional
>>> information when/why the drbd connection was lost.
>>>
>>> Regards,
>>> Andreas
>>>
>>> --
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>>
>>>>
>>>> I am having trouble sorting through the logging information because
>>>> there is so much of it in /var/log/daemon.log, but I can't find an
>>>> error message printed about why it will not promote node2. At this point
>>>> the DRBD devices are as follows:
>>>> node2: cstate = WFConnection dstate=UpToDate
>>>> node1: cstate = StandAlone dstate=Outdated
>>>>
>>>> I don't see any reason why node2 can't become DRBD master, or am I
>>>> missing something? If I do "drbdadm connect all" on node1, then the
>>>> cstate on both nodes changes to "Connected" and node2 immediately
>>>> promotes the DRBD resources to master. Any ideas on why I'm observing
>>>> this incorrect behavior?
>>>>
>>>> Any tips on how I can better filter through the pacemaker/heartbeat logs
>>>> or how to get additional useful debug information?
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From: *"Andreas Kurz" <andreas [at] hastexo
> <mailto:andreas [at] hastexo>>
>>>> *To: *pacemaker [at] oss
>> <mailto:*pacemaker [at] oss>
>>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>>> master on failover
>>>>
>>>> On 01/25/2012 08:58 PM, Andrew Martin wrote:
>>>>> Hello,
>>>>>
>>>>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6
>>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster
> includes
>>>>> the following resources:
>>>>> - primitives for DRBD storage devices
>>>>> - primitives for mounting the filesystem on the DRBD storage
>>>>> - primitives for some mount binds
>>>>> - primitive for starting apache
>>>>> - primitives for starting samba and nfs servers (following instructions
>>>>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
>>>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>>>>
>>>> not enough information ... please share at least your complete cluster
>>>> configuration
>>>>
>>>> Regards,
>>>> Andreas
>>>>
>>>> --
>>>> Need help with Pacemaker?
>>>> http://www.hastexo.com/now
>>>>
>>>>>
>>>>> Perhaps this is best described through the output of crm_mon:
>>>>> Online: [ node1 node2 ]
>>>>>
>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
>>>>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2
>>> (unmanaged)
>>>>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
>>>>> (unmanaged) FAILED
>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
>>>>> (unmanaged) FAILED
>>>>> Slaves: [ node2 ]
>>>>> Resource Group: g_core
>>>>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
>>>>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
>>>>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
>>>>> Resource Group: g_apache
>>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
>>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
>>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
>>>>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
>>>>> p_apache (ocf::heartbeat:apache): Started node1
>>>>> Resource Group: g_fileservers
>>>>> p_lsb_smb (lsb:smbd): Started node1
>>>>> p_lsb_nmb (lsb:nmbd): Started node1
>>>>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
>>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1
>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started
> node1
>>>>>
>>>>> I have read through the Pacemaker Explained
>>>>>
>>>>
>>>
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
>>>>> documentation, however could not find a way to further debug these
>>>>> problems. First, I put node1 into standby mode to attempt failover to
>>>>> the other node (node2). Node2 appeared to start the transition to
>>>>> master, however it failed to promote the DRBD resources to master (the
>>>>> first step). I have attached a copy of this session in commands.log and
>>>>> additional excerpts from /var/log/syslog during important steps. I have
>>>>> attempted everything I can think of to try and start the DRBD resource
>>>>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
>>>>> heartbeat) but cannot bring it out of the slave state. However, if
> I set
>>>>> it to unmanaged and then run drbdadm primary all in the terminal,
>>>>> pacemaker is satisfied and continues starting the rest of the
> resources.
>>>>> It then failed when attempting to mount the filesystem for mount2, the
>>>>> p_fs_mount2 resource. I attempted to mount the filesystem myself
> and was
>>>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
>>>>> it mounted. The rest of the resources started as expected until the
>>>>> p_exportfs_mount2 resource, which failed as follows:
>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
>>>>> (unmanaged) FAILED
>>>>>
>>>>> I ran cleanup on this and it started, however when running this test
>>>>> earlier today no command could successfully start this exportfs
>> resource.
>>>>>
>>>>> How can I configure pacemaker to better resolve these problems and be
>>>>> able to bring the node up successfully on its own? What can I check to
>>>>> determine why these failures are occuring? /var/log/syslog did not seem
>>>>> to contain very much useful information regarding why the failures
>>>> occurred.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This body part will be downloaded on demand.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker [at] oss
>> <mailto:Pacemaker [at] oss>
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker [at] oss
>> <mailto:Pacemaker [at] oss>
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
>> <mailto:Pacemaker [at] oss>
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
>> <mailto:Pacemaker [at] oss>
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> <mailto:Pacemaker [at] oss>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


emi2fast at gmail

Mar 30, 2012, 7:43 AM

Post #16 of 16 (1662 views)
Permalink
Re: Nodes will not promote DRBD resources to master on failover [In reply to]

can you show me?

crm configure show

Il giorno 30 marzo 2012 16:10, Andrew Martin <amartin [at] xes-inc> ha
scritto:

> Hi Andreas,
>
> Here is a copy of my complete CIB:
> http://pastebin.com/v5wHVFuy
>
> I'll work on generating a report using crm_report as well.
>
> Thanks,
>
> Andrew
>
> ------------------------------
> *From: *"Andreas Kurz" <andreas [at] hastexo>
> *To: *pacemaker [at] oss
> *Sent: *Friday, March 30, 2012 4:41:16 AM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 03/28/2012 04:56 PM, Andrew Martin wrote:
> > Hi Andreas,
> >
> > I disabled the DRBD init script and then restarted the slave node
> > (node2). After it came back up, DRBD did not start:
> > Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending
> > Online: [ node2 node1 ]
> >
> > Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> > Masters: [ node1 ]
> > Stopped: [ p_drbd_vmstore:1 ]
> > Master/Slave Set: ms_drbd_mount1 [p_drbd_tools]
> > Masters: [ node1 ]
> > Stopped: [ p_drbd_mount1:1 ]
> > Master/Slave Set: ms_drbd_mount2 [p_drbdmount2]
> > Masters: [ node1 ]
> > Stopped: [ p_drbd_mount2:1 ]
> > ...
> >
> > root [at] node:~# service drbd status
> > drbd not loaded
>
> Yes, expected unless Pacemaker starts DRBD
>
> >
> > Is there something else I need to change in the CIB to ensure that DRBD
> > is started? All of my DRBD devices are configured like this:
> > primitive p_drbd_mount2 ocf:linbit:drbd \
> > params drbd_resource="mount2" \
> > op monitor interval="15" role="Master" \
> > op monitor interval="30" role="Slave"
> > ms ms_drbd_mount2 p_drbd_mount2 \
> > meta master-max="1" master-node-max="1" clone-max="2"
> > clone-node-max="1" notify="true"
>
> That should be enough ... unable to say more without seeing the complete
> configuration ... too much fragments of information ;-)
>
> Please provide (e.g. pastebin) your complete cib (cibadmin -Q) when
> cluster is in that state ... or even better create a crm_report archive
>
> >
> > Here is the output from the syslog (grep -i drbd /var/log/syslog):
> > Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
> > key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> > op=p_drbd_vmstore:1_monitor_0 )
> > Mar 28 09:24:47 node2 lrmd: [3210]: info: rsc:p_drbd_vmstore:1 probe[2]
> > (pid 3455)
> > Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
> > key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> > op=p_drbd_mount1:1_monitor_0 )
> > Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount1:1 probe[3]
> > (pid 3456)
> > Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
> > key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> > op=p_drbd_mount2:1_monitor_0 )
> > Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount2:1 probe[4]
> > (pid 3457)
> > Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING: Couldn't find
> > device [/dev/drbd0]. Expected /dev/??? to exist
> > Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked:
> > crm_attribute -N node2 -n master-p_drbd_mount2:1 -l reboot -D
> > Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked:
> > crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l reboot -D
> > Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked:
> > crm_attribute -N node2 -n master-p_drbd_mount1:1 -l reboot -D
> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[4] on
> > p_drbd_mount2:1 for client 3213: pid 3457 exited with return code 7
> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[2] on
> > p_drbd_vmstore:1 for client 3213: pid 3455 exited with return code 7
> > Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
> > operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=10,
> > confirmed=true) not running
> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[3] on
> > p_drbd_mount1:1 for client 3213: pid 3456 exited with return code 7
> > Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
> > operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11,
> > confirmed=true) not running
> > Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
> > operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=12,
> > confirmed=true) not running
>
> No errors, just probing ... so for any reason Pacemaker does not like to
> start it ... use crm_simulate to find out why ... or provide information
> as requested above.
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
> >
> > Thanks,
> >
> > Andrew
> >
> > ------------------------------------------------------------------------
> > *From: *"Andreas Kurz" <andreas [at] hastexo>
> > *To: *pacemaker [at] oss
> > *Sent: *Wednesday, March 28, 2012 9:03:06 AM
> > *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> > master on failover
> >
> > On 03/28/2012 03:47 PM, Andrew Martin wrote:
> >> Hi Andreas,
> >>
> >>> hmm ... what is that fence-peer script doing? If you want to use
> >>> resource-level fencing with the help of dopd, activate the
> >>> drbd-peer-outdater script in the line above ... and double check if the
> >>> path is correct
> >> fence-peer is just a wrapper for drbd-peer-outdater that does some
> >> additional logging. In my testing dopd has been working well.
> >
> > I see
> >
> >>
> >>>> I am thinking of making the following changes to the CIB (as per the
> >>>> official DRBD
> >>>> guide
> >>
> > http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html)
> in
> >>>> order to add the DRBD lsb service and require that it start before the
> >>>> ocf:linbit:drbd resources. Does this look correct?
> >>>
> >>> Where did you read that? No, deactivate the startup of DRBD on system
> >>> boot and let Pacemaker manage it completely.
> >>>
> >>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
> >>>> colocation c_drbd_together inf:
> >>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
> >>>> ms_drbd_mount2:Master
> >>>> order drbd_init_first inf: ms_drbd_vmstore:promote
> >>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
> >>>>
> >>>> This doesn't seem to require that drbd be also running on the node
> where
> >>>> the ocf:linbit:drbd resources are slave (which it would need to do to
> be
> >>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
> >>>> (clone cl_drbd p_drbd-init ?)
> >>>
> >>> This is really not needed.
> >> I was following the official DRBD Users Guide:
> >>
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html
> >>
> >> If I am understanding your previous message correctly, I do not need to
> >> add a lsb primitive for the drbd daemon? It will be
> >> started/stopped/managed automatically by my ocf:linbit:drbd resources
> >> (and I can remove the /etc/rc* symlinks)?
> >
> > Yes, you don't need that LSB script when using Pacemaker and should not
> > let init start it.
> >
> > Regards,
> > Andreas
> >
> > --
> > Need help with Pacemaker?
> > http://www.hastexo.com/now
> >
> >>
> >> Thanks,
> >>
> >> Andrew
> >>
> >> ------------------------------------------------------------------------
> >> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:andreas [at] hastexo
> >>
> >> *To: *pacemaker [at] oss <mailto:
> pacemaker [at] oss>
> >> *Sent: *Wednesday, March 28, 2012 7:27:34 AM
> >> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> >> master on failover
> >>
> >> On 03/28/2012 12:13 AM, Andrew Martin wrote:
> >>> Hi Andreas,
> >>>
> >>> Thanks, I've updated the colocation rule to be in the correct order. I
> >>> also enabled the STONITH resource (this was temporarily disabled before
> >>> for some additional testing). DRBD has its own network connection over
> >>> the br1 interface (192.168.5.0/24 network), a direct crossover cable
> >>> between node1 and node2:
> >>> global { usage-count no; }
> >>> common {
> >>> syncer { rate 110M; }
> >>> }
> >>> resource vmstore {
> >>> protocol C;
> >>> startup {
> >>> wfc-timeout 15;
> >>> degr-wfc-timeout 60;
> >>> }
> >>> handlers {
> >>> #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t
> 5";
> >>> fence-peer "/usr/local/bin/fence-peer";
> >>
> >> hmm ... what is that fence-peer script doing? If you want to use
> >> resource-level fencing with the help of dopd, activate the
> >> drbd-peer-outdater script in the line above ... and double check if the
> >> path is correct
> >>
> >>> split-brain "/usr/lib/drbd/notify-split-brain.sh
> >>> me [at] example <mailto:me [at] example>";
> >>> }
> >>> net {
> >>> after-sb-0pri discard-zero-changes;
> >>> after-sb-1pri discard-secondary;
> >>> after-sb-2pri disconnect;
> >>> cram-hmac-alg md5;
> >>> shared-secret "xxxxx";
> >>> }
> >>> disk {
> >>> fencing resource-only;
> >>> }
> >>> on node1 {
> >>> device /dev/drbd0;
> >>> disk /dev/sdb1;
> >>> address 192.168.5.10:7787;
> >>> meta-disk internal;
> >>> }
> >>> on node2 {
> >>> device /dev/drbd0;
> >>> disk /dev/sdf1;
> >>> address 192.168.5.11:7787;
> >>> meta-disk internal;
> >>> }
> >>> }
> >>> # and similar for mount1 and mount2
> >>>
> >>> Also, here is my ha.cf. It uses both the direct link between the nodes
> >>> (br1) and the shared LAN network on br0 for communicating:
> >>> autojoin none
> >>> mcast br0 239.0.0.43 694 1 0
> >>> bcast br1
> >>> warntime 5
> >>> deadtime 15
> >>> initdead 60
> >>> keepalive 2
> >>> node node1
> >>> node node2
> >>> node quorumnode
> >>> crm respawn
> >>> respawn hacluster /usr/lib/heartbeat/dopd
> >>> apiauth dopd gid=haclient uid=hacluster
> >>>
> >>> I am thinking of making the following changes to the CIB (as per the
> >>> official DRBD
> >>> guide
> >>
> > http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html)
> in
> >>> order to add the DRBD lsb service and require that it start before the
> >>> ocf:linbit:drbd resources. Does this look correct?
> >>
> >> Where did you read that? No, deactivate the startup of DRBD on system
> >> boot and let Pacemaker manage it completely.
> >>
> >>> primitive p_drbd-init lsb:drbd op monitor interval="30"
> >>> colocation c_drbd_together inf:
> >>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
> >>> ms_drbd_mount2:Master
> >>> order drbd_init_first inf: ms_drbd_vmstore:promote
> >>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
> >>>
> >>> This doesn't seem to require that drbd be also running on the node
> where
> >>> the ocf:linbit:drbd resources are slave (which it would need to do to
> be
> >>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?
> >>> (clone cl_drbd p_drbd-init ?)
> >>
> >> This is really not needed.
> >>
> >> Regards,
> >> Andreas
> >>
> >> --
> >> Need help with Pacemaker?
> >> http://www.hastexo.com/now
> >>
> >>>
> >>> Thanks,
> >>>
> >>> Andrew
> >>>
> ------------------------------------------------------------------------
> >>> *From: *"Andreas Kurz" <andreas [at] hastexo <mailto:
> andreas [at] hastexo>>
> >>> *To: *pacemaker [at] oss
> > <mailto:*pacemaker [at] oss>
> >>> *Sent: *Monday, March 26, 2012 5:56:22 PM
> >>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> >>> master on failover
> >>>
> >>> On 03/24/2012 08:15 PM, Andrew Martin wrote:
> >>>> Hi Andreas,
> >>>>
> >>>> My complete cluster configuration is as follows:
> >>>> ============
> >>>> Last updated: Sat Mar 24 13:51:55 2012
> >>>> Last change: Sat Mar 24 13:41:55 2012
> >>>> Stack: Heartbeat
> >>>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
> >>>> with quorum
> >>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> >>>> 3 Nodes configured, unknown expected votes
> >>>> 19 Resources configured.
> >>>> ============
> >>>>
> >>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE
> > (standby)
> >>>> Online: [ node2 node1 ]
> >>>>
> >>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> >>>> Masters: [ node2 ]
> >>>> Slaves: [ node1 ]
> >>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> >>>> Masters: [ node2 ]
> >>>> Slaves: [ node1 ]
> >>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> >>>> Masters: [ node2 ]
> >>>> Slaves: [ node1 ]
> >>>> Resource Group: g_vm
> >>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
> >>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2
> >>>> Clone Set: cl_daemons [g_daemons]
> >>>> Started: [ node2 node1 ]
> >>>> Stopped: [ g_daemons:2 ]
> >>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
> >>>> Started: [ node2 node1 ]
> >>>> Stopped: [ p_sysadmin_notify:2 ]
> >>>> stonith-node1(stonith:external/tripplitepdu):Started node2
> >>>> stonith-node2(stonith:external/tripplitepdu):Started node1
> >>>> Clone Set: cl_ping [p_ping]
> >>>> Started: [ node2 node1 ]
> >>>> Stopped: [ p_ping:2 ]
> >>>>
> >>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
> >>>> attributes standby="off"
> >>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
> >>>> attributes standby="off"
> >>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
> >>>> attributes standby="on"
> >>>> primitive p_drbd_mount2 ocf:linbit:drbd \
> >>>> params drbd_resource="mount2" \
> >>>> op monitor interval="15" role="Master" \
> >>>> op monitor interval="30" role="Slave"
> >>>> primitive p_drbd_mount1 ocf:linbit:drbd \
> >>>> params drbd_resource="mount1" \
> >>>> op monitor interval="15" role="Master" \
> >>>> op monitor interval="30" role="Slave"
> >>>> primitive p_drbd_vmstore ocf:linbit:drbd \
> >>>> params drbd_resource="vmstore" \
> >>>> op monitor interval="15" role="Master" \
> >>>> op monitor interval="30" role="Slave"
> >>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
> >>>> params device="/dev/drbd0" directory="/vmstore" fstype="ext4"
> \
> >>>> op start interval="0" timeout="60s" \
> >>>> op stop interval="0" timeout="60s" \
> >>>> op monitor interval="20s" timeout="40s"
> >>>> primitive p_libvirt-bin upstart:libvirt-bin \
> >>>> op monitor interval="30"
> >>>> primitive p_ping ocf:pacemaker:ping \
> >>>> params name="p_ping" host_list="192.168.1.10 192.168.1.11"
> >>>> multiplier="1000" \
> >>>> op monitor interval="20s"
> >>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
> >>>> params email="me [at] example <mailto:me [at] example>" \
> >>>> params subject="Pacemaker Change" \
> >>>> op start interval="0" timeout="10" \
> >>>> op stop interval="0" timeout="10" \
> >>>> op monitor interval="10" timeout="10"
> >>>> primitive p_vm ocf:heartbeat:VirtualDomain \
> >>>> params config="/vmstore/config/vm.xml" \
> >>>> meta allow-migrate="false" \
> >>>> op start interval="0" timeout="120s" \
> >>>> op stop interval="0" timeout="120s" \
> >>>> op monitor interval="10" timeout="30"
> >>>> primitive stonith-node1 stonith:external/tripplitepdu \
> >>>> params pdu_ipaddr="192.168.1.12" pdu_port="1"
> pdu_username="xxx"
> >>>> pdu_password="xxx" hostname_to_stonith="node1"
> >>>> primitive stonith-node2 stonith:external/tripplitepdu \
> >>>> params pdu_ipaddr="192.168.1.12" pdu_port="2"
> pdu_username="xxx"
> >>>> pdu_password="xxx" hostname_to_stonith="node2"
> >>>> group g_daemons p_libvirt-bin
> >>>> group g_vm p_fs_vmstore p_vm
> >>>> ms ms_drbd_mount2 p_drbd_mount2 \
> >>>> meta master-max="1" master-node-max="1" clone-max="2"
> >>>> clone-node-max="1" notify="true"
> >>>> ms ms_drbd_mount1 p_drbd_mount1 \
> >>>> meta master-max="1" master-node-max="1" clone-max="2"
> >>>> clone-node-max="1" notify="true"
> >>>> ms ms_drbd_vmstore p_drbd_vmstore \
> >>>> meta master-max="1" master-node-max="1" clone-max="2"
> >>>> clone-node-max="1" notify="true"
> >>>> clone cl_daemons g_daemons
> >>>> clone cl_ping p_ping \
> >>>> meta interleave="true"
> >>>> clone cl_sysadmin_notify p_sysadmin_notify
> >>>> location l-st-node1 stonith-node1 -inf: node1
> >>>> location l-st-node2 stonith-node2 -inf: node2
> >>>> location l_run_on_most_connected p_vm \
> >>>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
> >>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
> >>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
> >>>
> >>> As Emmanuel already said, g_vm has to be in the first place in this
> >>> collocation constraint .... g_vm must be colocated with the drbd
> masters.
> >>>
> >>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
> >>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
> >>>> property $id="cib-bootstrap-options" \
> >>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
> >>>> cluster-infrastructure="Heartbeat" \
> >>>> stonith-enabled="false" \
> >>>> no-quorum-policy="stop" \
> >>>> last-lrm-refresh="1332539900" \
> >>>> cluster-recheck-interval="5m" \
> >>>> crmd-integration-timeout="3m" \
> >>>> shutdown-escalation="5m"
> >>>>
> >>>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
> >>>> PDUMH20ATNET that I'm using as the STONITH device:
> >>>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
> >>>
> >>> And why don't using it? .... stonith-enabled="false"
> >>>
> >>>>
> >>>> As you can see, I left the DRBD service to be started by the operating
> >>>> system (as an lsb script at boot time) however Pacemaker controls
> >>>> actually bringing up/taking down the individual DRBD devices.
> >>>
> >>> Don't start drbd on system boot, give Pacemaker the full control.
> >>>
> >>> The
> >>>> behavior I observe is as follows: I issue "crm resource migrate p_vm"
> on
> >>>> node1 and failover successfully to node2. During this time, node2
> fences
> >>>> node1's DRBD devices (using dopd) and marks them as Outdated.
> Meanwhile
> >>>> node2's DRBD devices are UpToDate. I then shutdown both nodes and then
> >>>> bring them back up. They reconnect to the cluster (with quorum), and
> >>>> node1's DRBD devices are still Outdated as expected and node2's DRBD
> >>>> devices are still UpToDate, as expected. At this point, DRBD starts on
> >>>> both nodes, however node2 will not set DRBD as master:
> >>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE
> > (standby)
> >>>> Online: [ node2 node1 ]
> >>>>
> >>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> >>>> Slaves: [ node1 node2 ]
> >>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> >>>> Slaves: [ node1 node 2 ]
> >>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> >>>> Slaves: [ node1 node2 ]
> >>>
> >>> There should really be no interruption of the drbd replication on vm
> >>> migration that activates the dopd ... drbd has its own direct network
> >>> connection?
> >>>
> >>> Please share your ha.cf file and your drbd configuration. Watch out
> for
> >>> drbd messages in your kernel log file, that should give you additional
> >>> information when/why the drbd connection was lost.
> >>>
> >>> Regards,
> >>> Andreas
> >>>
> >>> --
> >>> Need help with Pacemaker?
> >>> http://www.hastexo.com/now
> >>>
> >>>>
> >>>> I am having trouble sorting through the logging information because
> >>>> there is so much of it in /var/log/daemon.log, but I can't find an
> >>>> error message printed about why it will not promote node2. At this
> point
> >>>> the DRBD devices are as follows:
> >>>> node2: cstate = WFConnection dstate=UpToDate
> >>>> node1: cstate = StandAlone dstate=Outdated
> >>>>
> >>>> I don't see any reason why node2 can't become DRBD master, or am I
> >>>> missing something? If I do "drbdadm connect all" on node1, then the
> >>>> cstate on both nodes changes to "Connected" and node2 immediately
> >>>> promotes the DRBD resources to master. Any ideas on why I'm observing
> >>>> this incorrect behavior?
> >>>>
> >>>> Any tips on how I can better filter through the pacemaker/heartbeat
> logs
> >>>> or how to get additional useful debug information?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Andrew
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>> *From: *"Andreas Kurz" <andreas [at] hastexo
> > <mailto:andreas [at] hastexo>>
> >>>> *To: *pacemaker [at] oss
> >> <mailto:*pacemaker [at] oss>
> >>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
> >>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> >>>> master on failover
> >>>>
> >>>> On 01/25/2012 08:58 PM, Andrew Martin wrote:
> >>>>> Hello,
> >>>>>
> >>>>> Recently I finished configuring a two-node cluster with pacemaker
> 1.1.6
> >>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster
> > includes
> >>>>> the following resources:
> >>>>> - primitives for DRBD storage devices
> >>>>> - primitives for mounting the filesystem on the DRBD storage
> >>>>> - primitives for some mount binds
> >>>>> - primitive for starting apache
> >>>>> - primitives for starting samba and nfs servers (following
> instructions
> >>>>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
> >>>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
> >>>>
> >>>> not enough information ... please share at least your complete cluster
> >>>> configuration
> >>>>
> >>>> Regards,
> >>>> Andreas
> >>>>
> >>>> --
> >>>> Need help with Pacemaker?
> >>>> http://www.hastexo.com/now
> >>>>
> >>>>>
> >>>>> Perhaps this is best described through the output of crm_mon:
> >>>>> Online: [ node1 node2 ]
> >>>>>
> >>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> >>>>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2
> >>> (unmanaged)
> >>>>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1
> >>>>> (unmanaged) FAILED
> >>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> >>>>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1
> >>>>> (unmanaged) FAILED
> >>>>> Slaves: [ node2 ]
> >>>>> Resource Group: g_core
> >>>>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1
> >>>>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1
> >>>>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1
> >>>>> Resource Group: g_apache
> >>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1
> >>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1
> >>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1
> >>>>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1
> >>>>> p_apache (ocf::heartbeat:apache): Started node1
> >>>>> Resource Group: g_fileservers
> >>>>> p_lsb_smb (lsb:smbd): Started node1
> >>>>> p_lsb_nmb (lsb:nmbd): Started node1
> >>>>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1
> >>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started
> node1
> >>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started
> > node1
> >>>>>
> >>>>> I have read through the Pacemaker Explained
> >>>>>
> >>>>
> >>>
> > <
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained
> >
> >>>>> documentation, however could not find a way to further debug these
> >>>>> problems. First, I put node1 into standby mode to attempt failover to
> >>>>> the other node (node2). Node2 appeared to start the transition to
> >>>>> master, however it failed to promote the DRBD resources to master
> (the
> >>>>> first step). I have attached a copy of this session in commands.log
> and
> >>>>> additional excerpts from /var/log/syslog during important steps. I
> have
> >>>>> attempted everything I can think of to try and start the DRBD
> resource
> >>>>> (e.g. start/stop/promote/manage/cleanup under crm resource,
> restarting
> >>>>> heartbeat) but cannot bring it out of the slave state. However, if
> > I set
> >>>>> it to unmanaged and then run drbdadm primary all in the terminal,
> >>>>> pacemaker is satisfied and continues starting the rest of the
> > resources.
> >>>>> It then failed when attempting to mount the filesystem for mount2,
> the
> >>>>> p_fs_mount2 resource. I attempted to mount the filesystem myself
> > and was
> >>>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and
> then
> >>>>> it mounted. The rest of the resources started as expected until the
> >>>>> p_exportfs_mount2 resource, which failed as follows:
> >>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2
> >>>>> (unmanaged) FAILED
> >>>>>
> >>>>> I ran cleanup on this and it started, however when running this test
> >>>>> earlier today no command could successfully start this exportfs
> >> resource.
> >>>>>
> >>>>> How can I configure pacemaker to better resolve these problems and be
> >>>>> able to bring the node up successfully on its own? What can I check
> to
> >>>>> determine why these failures are occuring? /var/log/syslog did not
> seem
> >>>>> to contain very much useful information regarding why the failures
> >>>> occurred.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Andrew
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> This body part will be downloaded on demand.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Pacemaker mailing list: Pacemaker [at] oss
> >> <mailto:Pacemaker [at] oss>
> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>
> >>>> Project Home: http://www.clusterlabs.org
> >>>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>> Bugs: http://bugs.clusterlabs.org
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Pacemaker mailing list: Pacemaker [at] oss
> >> <mailto:Pacemaker [at] oss>
> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>
> >>>> Project Home: http://www.clusterlabs.org
> >>>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker [at] oss
> >> <mailto:Pacemaker [at] oss>
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker [at] oss
> >> <mailto:Pacemaker [at] oss>
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker [at] oss
> >> <mailto:Pacemaker [at] oss>
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker [at] oss
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker [at] oss
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker [at] oss
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


--
esta es mi vida e me la vivo hasta que dios quiera

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.