Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

question about stonith:external/libvirt

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


matt at ecsorl

May 19, 2012, 9:40 PM

Post #1 of 5 (476 views)
Permalink
question about stonith:external/libvirt

After using the tutorial on the Hastexo site for setting up stonith via
libvirt, I believe I have it working correctly...but...some strange
things are happening. I have two nodes, with shared storage provided by
a dual-primary DRBD resource and OCFS2. Here is one of my stonith
primitives:

primitive p_fence-l2 stonith:external/libvirt \
params hostlist="l2:l2.sandbox"
hypervisor_uri="qemu+ssh://matt [at] hv0/system" stonith-timeout="30"
pcmk_host_check="none" \
op start interval="0" timeout="15" \
op stop interval="0" timeout="15" \
op monitor interval="60" \
meta target-role="Started"

This cluster has stonith-enabled="true" in the cluster options, plus the
necessary location statements in the cib.

To watch the DLM, I run dbench on the shared storage on the node I let
live. While it's running, I creatively nuke the other node. If I just
"killall pacemakerd" on l2 for instance, the DLM seems unaffected and
the fence takes place, rebooting the now "failed" node l2. No real
interruption of service on the surviving node, l3. Yet, if I "halt -f
-n" on l2, the fence still takes place but the surviving node's (l3's)
DLM hangs and won't come back until I bring the failed node back
online. Note that l2 and l3 can be interchanged - the results are the
same. Note that when the DLM is hung as in the latter case, eventually
kernel messages about hung tasks start populating the syslog.

I thought I had recently read some posts concerning this very topic, but
for the life of me I can't find them...
Any ideas on how I should proceed, or what I should look for next?

Thanks!
-- Matt




_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


florian at hastexo

May 21, 2012, 2:43 AM

Post #2 of 5 (460 views)
Permalink
Re: question about stonith:external/libvirt [In reply to]

On Sun, May 20, 2012 at 6:40 AM, Matthew O'Connor <matt [at] ecsorl> wrote:
> After using the tutorial on the Hastexo site for setting up stonith via
> libvirt, I believe I have it working correctly...but...some strange things
> are happening.  I have two nodes, with shared storage provided by a
> dual-primary DRBD resource and OCFS2.  Here is one of my stonith primitives:
>
> primitive p_fence-l2 stonith:external/libvirt \
>        params hostlist="l2:l2.sandbox"
> hypervisor_uri="qemu+ssh://matt [at] hv0/system" stonith-timeout="30"
> pcmk_host_check="none" \
>        op start interval="0" timeout="15" \
>        op stop interval="0" timeout="15" \
>        op monitor interval="60" \
>        meta target-role="Started"
>
> This cluster has stonith-enabled="true" in the cluster options, plus the
> necessary location statements in the cib.

Does it have "fencing resource-and-stonith" in the DRBD configuration,
and stonith_admin-fence-peer.sh as its fence-peer handler?

> To watch the DLM, I run dbench on the shared storage on the node I let live.
>  While it's running, I creatively nuke the other node.  If I just "killall
> pacemakerd" on l2 for instance, the DLM seems unaffected and the fence takes
> place, rebooting the now "failed" node l2.  No real interruption of service
> on the surviving node, l3.  Yet, if I "halt -f -n" on l2, the fence still
> takes place but the surviving node's (l3's) DLM hangs and won't come back
> until I bring the failed node back online.

A hanging DLM is OK, and DLM recovery after the failed node comes back
is OK too, but of course the DLM should also recover once it's
satisfied that the offending node has been properly fenced. Any logs
from stonith-ng on l3?

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


matt at ecsorl

May 21, 2012, 11:14 AM

Post #3 of 5 (445 views)
Permalink
Re: question about stonith:external/libvirt [In reply to]

On 05/21/2012 05:43 AM, Florian Haas wrote:
> Does it have "fencing resource-and-stonith" in the DRBD configuration,
> and stonith_admin-fence-peer.sh as its fence-peer handler?
That was the problem. Totally forgot to update my DRBD configuration.
For sake of testing, I used the "crm-fence-peer.sh" script - it seemed
to do the trick, although I strongly suspect this is the wrong script
for the job. Do I need to write my own script to call stonith_admin?

Thanks!
-- Matthew




_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


florian at hastexo

May 21, 2012, 11:26 AM

Post #4 of 5 (455 views)
Permalink
Re: question about stonith:external/libvirt [In reply to]

On Mon, May 21, 2012 at 8:14 PM, Matthew O'Connor <matt [at] ecsorl> wrote:
> On 05/21/2012 05:43 AM, Florian Haas wrote:
>> Does it have "fencing resource-and-stonith" in the DRBD configuration,
>> and stonith_admin-fence-peer.sh as its fence-peer handler?
> That was the problem.  Totally forgot to update my DRBD configuration.

I actually wasn't saying that that was the root cause of your problem.
:) But it's worth looking into, anyhow.

> For sake of testing, I used the "crm-fence-peer.sh" script - it seemed
> to do the trick, although I strongly suspect this is the wrong script
> for the job.

It is. No good for dual-Primary, really, as it doesn't prevent split
brain in that sort of configuration.

> Do I need to write my own script to call stonith_admin?

No, stonith_admin-fence-peer.sh ships with recent DRBD releases.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


matt at ecsorl

May 21, 2012, 12:35 PM

Post #5 of 5 (457 views)
Permalink
Re: question about stonith:external/libvirt [In reply to]

On 05/21/2012 02:26 PM, Florian Haas wrote:
> On Mon, May 21, 2012 at 8:14 PM, Matthew O'Connor <matt [at] ecsorl> wrote:
>> On 05/21/2012 05:43 AM, Florian Haas wrote:
>>> Does it have "fencing resource-and-stonith" in the DRBD configuration,
>>> and stonith_admin-fence-peer.sh as its fence-peer handler?
>> That was the problem. Totally forgot to update my DRBD configuration.
> I actually wasn't saying that that was the root cause of your problem.
> :) But it's worth looking into, anyhow.

Ah - well, for sake of barking up the right tree, here is a snippet of
the logs of l2 after l3 was halted, and before making any changes to the
DRBD configuration:

May 19 23:00:13 l2 stonith-ng: [1554]: info: initiate_remote_stonith_op:
Initiating remote operation reboot for l3:
b1374d19-458b-4520-9cbf-e2e5812e6639
May 19 23:00:13 l2 stonith-ng: [1554]: info: can_fence_host_with_device:
p_fence-l3 can fence l3: none
May 19 23:00:13 l2 stonith-ng: [1554]: info: call_remote_stonith:
Requesting that l2 perform op reboot l3
May 19 23:00:13 l2 stonith-ng: [1554]: info: stonith_fence: Exec
<stonith_command t="stonith-ng"
st_async_id="b1374d19-458b-4520-9cbf-e2e5812e6639" st_op="st_fence"
st_callid="0" st_callopt="0"
st_remote_op="b1374d19-458b-4520-9cbf-e2e5812e6639" st_target="l3"
st_device_action="reboot" st_timeout="54000" src="l2" seq="10" />
May 19 23:00:13 l2 stonith-ng: [1554]: info: can_fence_host_with_device:
p_fence-l3 can fence l3: none
May 19 23:00:13 l2 stonith-ng: [1554]: info: stonith_fence: Found 1
matching devices for 'l3'
May 19 23:00:13 l2 stonith-ng: [1554]: info: stonith_command: Processed
st_fence from l2: rc=-1
May 19 23:00:13 l2 stonith-ng: [1554]: info: make_args: reboot-ing node
'l3' as 'port=l3'
May 19 23:00:14 l2 stonith-ng: [1554]: info: stonith_command: Processed
st_execute from lrmd: rc=-1
May 19 23:00:19 l2 stonith-ng: [1554]: info: log_operation: Operation
'reboot' [7042] (call 0 from (null)) for host 'l3' with device
'p_fence-l3' returned: 0
May 19 23:00:19 l2 stonith-ng: [1554]: info: log_operation: p_fence-l3:
Performing: stonith -t external/libvirt -T reset l3
May 19 23:00:19 l2 stonith-ng: [1554]: info: log_operation: p_fence-l3:
success: l3 0
May 19 23:00:19 l2 stonith-ng: [1554]: info:
process_remote_stonith_exec: ExecResult <st-reply
st_origin="stonith_construct_async_reply" t="stonith-ng"
st_op="st_notify" st_remote_op="b1374d19-458b-4520-9cbf-e2e5812e6639"
st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith -t
external/libvirt -T reset l3#012success: l3 0#012" src="l2" seq="11" />
May 19 23:00:19 l2 stonith-ng: [1554]: info: remote_op_done: Notifing
clients of b1374d19-458b-4520-9cbf-e2e5812e6639 (reboot of l3 from
9f36c78b-06c8-4b62-bc84-6cb87b30351b by l2): 2, rc=0
May 19 23:00:19 l2 crmd: [1559]: info: tengine_stonith_callback:
StonithOp <st-reply st_origin="stonith_construct_async_reply"
t="stonith-ng" st_op="reboot"
st_remote_op="b1374d19-458b-4520-9cbf-e2e5812e6639" st_callid="0"
st_callopt="0" st_rc="0" st_output="Performing: stonith -t
external/libvirt -T reset l3#012success: l3 0#012" src="l2" seq="11"
state="2" st_target="l3" />
May 19 23:00:19 l2 stonith-ng: [1554]: info: stonith_notify_client:
Sending st_fence-notification to client
1559/b09a62f6-b077-4181-98da-91f43f40bc9a
May 19 23:00:19 l2 crmd: [1559]: info: tengine_stonith_callback:
StonithOp <st-reply st_origin="stonith_construct_async_reply"
t="stonith-ng" st_op="reboot"
st_remote_op="b1374d19-458b-4520-9cbf-e2e5812e6639" st_callid="0"
st_callopt="0" st_rc="0" st_output="Performing: stonith -t
external/libvirt -T reset l3#012success: l3 0#012" src="l2" seq="11"
state="2" st_target="l3" />
May 19 23:00:19 l2 crmd: [1559]: info: tengine_stonith_callback: Stonith
operation 4/82:118:0:b92bcccd-5765-469c-b56e-392cc065b65c: OK (0)
May 19 23:00:19 l2 crmd: [1559]: info: tengine_stonith_callback: Stonith
of l3 passed
May 19 23:00:19 l2 crmd: [1559]: info: send_stonith_update: Sending
fencing update 358 for l3
May 19 23:00:19 l2 stonith-ng: [1554]: info: stonith_notify_client:
Sending st_fence-notification to client
1559/b09a62f6-b077-4181-98da-91f43f40bc9a
May 19 23:00:19 l2 crmd: [1559]: info: tengine_stonith_notify: Peer l3
was terminated (reboot) by l2 for l2
(ref=b1374d19-458b-4520-9cbf-e2e5812e6639): OK
May 19 23:00:19 l2 crmd: [1559]: notice: tengine_stonith_notify:
Notified CMAN that 'l3' is now fenced
May 19 23:00:19 l2 crmd: [1559]: notice: tengine_stonith_notify:
Confirmed CMAN fencing event for 'l3'


AND here is a log snippet from after the DRBD configuration was updated.

May 21 14:36:02 l2 stonith-ng: [1618]: info: initiate_remote_stonith_op:
Initiating remote operation reboot for l3:
9c19ba05-363c-48b4-ade3-d9dac5087866
May 21 14:36:02 l2 stonith-ng: [1618]: info: can_fence_host_with_device:
p_fence-l3 can fence l3: none
May 21 14:36:02 l2 stonith-ng: [1618]: info: call_remote_stonith:
Requesting that l2 perform op reboot l3
May 21 14:36:02 l2 stonith-ng: [1618]: info: stonith_fence: Exec
<stonith_command t="stonith-ng"
st_async_id="9c19ba05-363c-48b4-ade3-d9dac5087866" st_op="st_fence"
st_callid="0" st_callopt="0"
st_remote_op="9c19ba05-363c-48b4-ade3-d9dac5087866" st_target="l3"
st_device_action="reboot" st_timeout="54000" src="l2" seq="20" />
May 21 14:36:02 l2 stonith-ng: [1618]: info: can_fence_host_with_device:
p_fence-l3 can fence l3: none
May 21 14:36:02 l2 stonith-ng: [1618]: info: stonith_fence: Found 1
matching devices for 'l3'
May 21 14:36:02 l2 stonith-ng: [1618]: info: stonith_command: Processed
st_fence from l2: rc=-1
May 21 14:36:02 l2 stonith-ng: [1618]: info: make_args: reboot-ing node
'l3' as 'port=l3'
May 21 14:36:08 l2 stonith-ng: [1618]: info: log_operation: Operation
'reboot' [341] (call 0 from (null)) for host 'l3' with device
'p_fence-l3' returned: 0
May 21 14:36:08 l2 stonith-ng: [1618]: info: log_operation: p_fence-l3:
Performing: stonith -t external/libvirt -T reset l3
May 21 14:36:08 l2 stonith-ng: [1618]: info: log_operation: p_fence-l3:
success: l3 0
May 21 14:36:08 l2 stonith-ng: [1618]: info:
process_remote_stonith_exec: ExecResult <st-reply
st_origin="stonith_construct_async_reply" t="stonith-ng"
st_op="st_notify" st_remote_op="9c19ba05-363c-48b4-ade3-d9dac5087866"
st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith -t
external/libvirt -T reset l3#012success: l3 0#012" src="l2" seq="21" />
May 21 14:36:08 l2 stonith-ng: [1618]: info: remote_op_done: Notifing
clients of 9c19ba05-363c-48b4-ade3-d9dac5087866 (reboot of l3 from
f782c9f8-71e1-4ec2-8f45-93a4b2f7f795 by l2): 2, rc=0
May 21 14:36:08 l2 crmd: [1623]: info: tengine_stonith_callback:
StonithOp <st-reply st_origin="stonith_construct_async_reply"
t="stonith-ng" st_op="reboot"
st_remote_op="9c19ba05-363c-48b4-ade3-d9dac5087866" st_callid="0"
st_callopt="0" st_rc="0" st_output="Performing: stonith -t
external/libvirt -T reset l3#012success: l3 0#012" src="l2" seq="21"
state="2" st_target="l3" />
May 21 14:36:08 l2 crmd: [1623]: info: tengine_stonith_callback: Stonith
operation 5/81:56:0:e647e4db-cb29-4db4-a0bc-b631fc35f5ec: OK (0)
May 21 14:36:08 l2 crmd: [1623]: info: tengine_stonith_callback: Stonith
of l3 passed
May 21 14:36:08 l2 crmd: [1623]: info: send_stonith_update: Sending
fencing update 276 for l3
May 21 14:36:08 l2 stonith-ng: [1618]: info: stonith_notify_client:
Sending st_fence-notification to client
1623/ffe204e9-3d5d-4a11-b605-084d3f61980d
May 21 14:36:08 l2 crmd: [1623]: info: tengine_stonith_notify: Peer l3
was terminated (reboot) by l2 for l2
(ref=9c19ba05-363c-48b4-ade3-d9dac5087866): OK
May 21 14:36:08 l2 stonith-ng: [1618]: info: stonith_device_execute:
Nothing to do for p_fence-l3
May 21 14:36:08 l2 crmd: [1623]: notice: tengine_stonith_notify:
Notified CMAN that 'l3' is now fenced
May 21 14:36:08 l2 crmd: [1623]: notice: tengine_stonith_notify:
Confirmed CMAN fencing event for 'l3'

I am not sure this reveals much, but chances are you will see something
I don't! ;-)

>> For sake of testing, I used the "crm-fence-peer.sh" script - it seemed
>> to do the trick, although I strongly suspect this is the wrong script
>> for the job.
> It is. No good for dual-Primary, really, as it doesn't prevent split
> brain in that sort of configuration.
Yes, that is perfectly sensible.

Perhaps my (still-in-testing) production cluster's problem will be a bit
simpler, then? The DRBD resource there is actually operated in
single-primary mode on a two-node cluster, because it is served up over
iSCSI to another cluster of machines. DLM/OCFS2 do not operate on the
DRBD/iSCSI host cluster, only on the iSCSI client cluster. So, in this
case, would the crm-fence-peer.sh then be sufficient for the DRBD
cluster nodes?


>
>> Do I need to write my own script to call stonith_admin?
> No, stonith_admin-fence-peer.sh ships with recent DRBD releases.
Sadness...not found on Ubuntu 12.04. They are providing v8.3.11. I
will check with them...

Thanks!!
-- Matthew

>
> Cheers,
> Florian
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.