Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

STONITH request failed

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


timkom at gmail

Nov 12, 2009, 2:07 AM

Post #1 of 2 (263 views)
Permalink
STONITH request failed

Hi all,

has anybody experienced following.
We were simulating the failure of a resource stop with STONITH enabled. The
node that failed to stop the resource requested to stonith itself. However,
the request was not processed by peer node. It has not logged anything into
log file at that time. Interestingly, this scenario works most of the time.
This was justthe first one but we are not aware of anything that was
modified on the system.
Could anyone try to explain how such situation could occur? Excerpt of logs:
SP1:
pengine[1186]: 2009/11/10_10:09:07 notice: native_print: SP2-stonith
(stonith:external/vsp-ipmi): Started sp2
pengine[1186]: 2009/11/10_10:09:07 notice: native_print: SP1-stonith
(stonith:external/vsp-ipmi): Started sp1
pengine[1186]: 2009/11/10_10:09:07 notice: group_print: Resource Group: udom
pengine[1186]: 2009/11/10_10:09:07 notice: native_print:
udom-drbd-udom0 (heartbeat:drbddisk): Started sp1
pengine[1186]: 2009/11/10_10:09:07 notice: native_print:
udom-drbd-udom1 (heartbeat:drbddisk): Started sp1
pengine[1186]: 2009/11/10_10:09:07 notice: native_print:
udom-drbd-udom2 (heartbeat:drbddisk): Started sp1
pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-delay
(ocf::heartbeat:Delay): Started sp1
pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-vm
(ocf::heartbeat:Xen): Started sp1
Initiated resource stop:
...
Xen[23346][24689]: 2009/11/10_10:09:38 INFO: Xen domain udom stopped.
crmd[966]: 2009/11/10_10:09:38 info: process_lrm_event: LRM operation
udom-vm_stop_0 (call=27, rc=0, cib-update=121, confirmed=true) complete ok
...
crmd[966]: 2009/11/10_10:09:48 info: process_lrm_event: LRM operation
udom-delay_stop_0 (call=30, rc=0, cib-update=124, confirmed=true) complete
ok
...
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) 3: State change failed: (-12) Device is held
open by someone
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) Command '
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) /sbin/drbdsetup
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr)
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) 3
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr)
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) secondary
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) ' terminated with exit code 11
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) /sbin/drbdadm secondary udom2: exit code 11,
mapping to 1
crmd[966]: 2009/11/10_10:09:48 info: process_lrm_event: LRM operation
udom-drbd-udom2_stop_0 (call=32, rc=1, cib-update=125, confirmed=true)
complete unknown error
cib[962]: 2009/11/10_10:09:48 info: cib_process_request: Operation complete:
op cib_modify for section 'all' (origin=local/crmd/125): ok (rc=0)
crmd[966]: 2009/11/10_10:09:48 WARN: status_from_rc: Action 22
(udom-drbd-udom2_stop_0) on sp1 failed (target: 0 vs. rc: 1): Error
crmd[966]: 2009/11/10_10:09:48 WARN: update_failcount: Updating failcount
for udom-drbd-udom2 on d3463b4e-6f4a-4ced-8b29-533c3f5c6744 after failed
stop: rc=1 (update=INFINITY, time=1257865788)

Decision to STONITH itself:
pengine[1186]: 2009/11/10_10:10:48 WARN: native_color: Resource SP1-stonith
cannot run anywhere
pengine[1186]: 2009/11/10_10:10:48 WARN: stage6: Scheduling Node sp1 for
STONITH
crmd[966]: 2009/11/10_10:10:48 info: te_fence_node: Executing reboot fencing
operation (36) on sp1 (timeout=60000)
stonithd[964]: 2009/11/10_10:10:48 info: client tengine [pid: 966] requests
a STONITH operation RESET on node sp1
stonithd[964]: 2009/11/10_10:10:48 info: we can't manage sp1, broadcast
request to other nodes
stonithd[964]: 2009/11/10_10:10:48 info: Broadcasting the message succeeded:
require others to stonith node sp1.
stonithd[964]: 2009/11/10_10:11:48 ERROR: Failed to STONITH the node sp1:
optype=RESET, op_result=TIMEOUT
...
crmd[966]: 2009/11/10_10:11:48 info: tengine_stonith_callback: call=-3,
optype=1, node_name=sp1, result=2, node_list=,
action=36:36:0:3f220573-b709-4243-bf70-5cc5b17c7a57
crmd[966]: 2009/11/10_10:11:48 ERROR: tengine_stonith_callback: Stonith of
sp1 failed (2)... aborting transition.
and again
pengine[1186]: 2009/11/10_10:11:48 WARN: stage6: Scheduling Node sp1 for
STONITH
stonithd[964]: 2009/11/10_10:11:48 info: client tengine [pid: 966] requests
a STONITH operation RESET on node sp1
crmd[966]: 2009/11/10_10:11:48 info: te_fence_node: Executing reboot fencing
operation (35) on sp1 (timeout=60000)
stonithd[964]: 2009/11/10_10:12:48 ERROR: Failed to STONITH the node sp1:
optype=RESET, op_result=TIMEOUT
crmd[966]: 2009/11/10_10:12:48 info: tengine_stonith_callback: call=-4,
optype=1, node_name=sp1, result=2, node_list=,
action=35:37:0:3f220573-b709-4243-bf70-5cc5b17c7a57
crmd[966]: 2009/11/10_10:12:48 ERROR: tengine_stonith_callback: Stonith of
sp1 failed (2)... aborting transition.
crmd[966]: 2009/11/10_10:12:48 info: abort_transition_graph:
tengine_stonith_callback:391 - Triggered transition abort (complete=0) :
Stonith failed


While logs of the peer node:
cib[4608]: 2009/11/10_10:08:59 info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
/var/lib/heartbeat/crm/cib.xml.sig)
cib[4777]: 2009/11/10_10:11:21 info: write_cib_contents: Wrote version
0.67.0 of the CIB to disk (digest: 93bfe20b495b39ace698b42ce04b8e2c)
cib[4777]: 2009/11/10_10:11:21 info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
/var/lib/heartbeat/crm/cib.xml.sig)
cib[4796]: 2009/11/10_10:13:42 info: write_cib_contents: Wrote version
0.68.0 of the CIB to disk (digest: 9162eeefc80fe9f83876e6324e58c2c7)
cib[4796]: 2009/11/10_10:13:42 info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
/var/lib/heartbeat/crm/cib.xml.sig)
cib[3110]: 2009/11/10_10:15:35 info: cib_stats: Processed 11 operations
(0.00us average, 0% utilization) in the last 10min

Simply no message to stonithd. Is stonith dependent on something on the
system? Connection between heartbeats worked fine.

# rpm -qa|egrep 'heart|pacem'
libheartbeat2-2.99.2-6.1
heartbeat-common-2.99.2-6.1
heartbeat-resources-2.99.2-6.1
heartbeat-2.99.2-6.1
pacemaker-1.0.2-11.1
libpacemaker3-1.0.2-11.1

# uname -a
Linux sp1 2.6.18-128.7.1.el5xen #1 SMP Mon Aug 24 09:14:33 EDT 2009 x86_64
x86_64 x86_64 GNU/Linux
CentOS5.3

Thanks,

Tino


lmb at suse

Nov 12, 2009, 2:59 AM

Post #2 of 2 (250 views)
Permalink
Re: STONITH request failed [In reply to]

On 2009-11-12T10:07:00, Maros Timko <timkom [at] gmail> wrote:

> # rpm -qa|egrep 'heart|pacem'
> libheartbeat2-2.99.2-6.1
> heartbeat-common-2.99.2-6.1
> heartbeat-resources-2.99.2-6.1
> heartbeat-2.99.2-6.1
> pacemaker-1.0.2-11.1
> libpacemaker3-1.0.2-11.1

Please update to something resembling current versions and try to
reproduce.

I'd also very much recommend to enable logd and write to syslog.
Personally, I find that log format unreadable.


Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


_______________________________________________
Pacemaker mailing list
Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.