Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

stonith failed to start

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


tinzauro at ha-solutions

Aug 19, 2009, 5:26 AM

Post #1 of 14 (1632 views)
Permalink
stonith failed to start

List,

I've got a dev cluster up and running with Xen/DRBD/heartbeat working. After a day or so of running, i saw that stonith had
failed to start on node2(it initially started just fine). I have seen this behavior before with this cluster.

What would cause the stonith 'start' operation to fail after it initially had succeeded?


crm_mon output:
---------------------------
Refresh in 10s...

============
Last updated: Wed Aug 19 06:33:12 2009
Current DC: node1 (47d563cc-f8ec-4b6d-8092-d80ceb64dbbd)
2 Nodes configured.
4 Resources configured.
============

Node: node2 (c95ba6f0-5dcf-41d3-abb0-25e55ae313eb): online
Node: node1 (47d563cc-f8ec-4b6d-8092-d80ceb64dbbd): online

xen1 (heartbeat::ocf:Xen): Started node2
xen2 (heartbeat::ocf:Xen): Started node1
xen3 (heartbeat::ocf:Xen): Started node2
Clone Set: Stonith_Clone_Group
stonithclone:0 (stonith:external/ssh): Started node1
stonithclone:1 (stonith:external/ssh): Stopped

Failed actions:
stonithclone:1_start_0 (node=node2, call=14, rc=1): complete


At first look, it appears that the monitor operation fails. Heartbeat then tries to start stonith on the failed node and then
the 'start' operation fails as well.

Aug 18 11:02:37 node1 tengine: [3950]: WARN: update_failcount: Updating failcount for stonithclone:1 on
c95ba6f0-5dcf-41d3-abb0-25e55ae313eb after failed monitor: rc=14
Aug 18 11:02:37 node1 crmd: [3859]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_IPC_MESSAGE origin=route_message ]
Aug 18 11:02:37 node1 crmd: [3859]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node node1 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node node2 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: unpack_find_resource: Internally renamed stonithclone:0 on node2 to stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: WARN: unpack_rsc_op: Processing failed op stonithclone:1_monitor_5000 on node2: Error
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: romulus#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: remus#011(heartbeat::ocf:Xen):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: fortuna#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: clone_print: Clone Set: Stonith_Clone_Group
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: stonithclone:0#011(stonith:external/ssh):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: stonithclone:1#011(stonith:external/ssh):#011Started node2
FAILED
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource xen2#011(node2)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource xen1#011(node1)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource xen3#011(node2)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource stonithclone:0#011(node1)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Recover resource stonithclone:1#011(node2)
Aug 18 11:02:37 node1 pengine: [3951]: notice: StopRsc: node2#011Stop stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: notice: StartRsc: node2#011Start stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: notice: RecurringOp: node2#011 stonithclone:1_monitor_5000
Aug 18 11:02:37 node1 tengine: [3950]: info: extract_event: Aborting on transient_attributes changes for
c95ba6f0-5dcf-41d3-abb0-25e55ae313eb
Aug 18 11:02:37 node1 pengine: [3951]: info: process_pe_message: Transition 3: PEngine Input stored in:
/var/lib/heartbeat/pengine/pe-input-31.bz2
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node node1 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node node2 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: unpack_find_resource: Internally renamed stonithclone:0 on node2 to stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: WARN: unpack_rsc_op: Processing failed op stonithclone:1_monitor_5000 on node2: Error
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: xen2#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: xen1#011(heartbeat::ocf:Xen):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: xen3#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: clone_print: Clone Set: Stonith_Clone_Group
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: stonithclone:0#011(stonith:external/ssh):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: stonithclone:1#011(stonith:external/ssh):#011Started node2
FAILED


If the node gets rebooted, it comes back with everything working as expected, for a while then it happens again.


Any insight would be greatly appreciated.



regards,


_Terry


_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


thomas at glanzmann

Aug 20, 2009, 2:01 AM

Post #2 of 14 (1563 views)
Permalink
Re: stonith failed to start [In reply to]

Hello Terry,

> What would cause the stonith 'start' operation to fail after it
> initially had succeeded?

if my understanding is correct (I wrote a stonith agent for vsphere
yesterday). Than it runs the status command of the stonith agent and
looks at the exist status, like that:

(ha-01) [~] VI_SERVER=esx-03.glanzmann.de VI_USERNAME=root /usr/lib/stonith/plugins/external/vsphere status; echo $?
Enter password:
0

Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Aug 20, 2009, 4:59 AM

Post #3 of 14 (1573 views)
Permalink
Re: stonith failed to start [In reply to]

Hi,

On Thu, Aug 20, 2009 at 11:01:43AM +0200, Thomas Glanzmann wrote:
> Hello Terry,
>
> > What would cause the stonith 'start' operation to fail after it
> > initially had succeeded?
>
> if my understanding is correct (I wrote a stonith agent for vsphere
> yesterday). Than it runs the status command of the stonith agent and
> looks at the exist status, like that:
>
> (ha-01) [~] VI_SERVER=esx-03.glanzmann.de VI_USERNAME=root /usr/lib/stonith/plugins/external/vsphere status; echo $?
> Enter password:
> 0

Right. The start operation includes a status. If the status
operation fails, the start obviously fails too.

Thanks,

Dejan

> Thomas
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


tinzauro at ha-solutions

Aug 20, 2009, 5:55 AM

Post #4 of 14 (1564 views)
Permalink
Re: stonith failed to start [In reply to]

Dejan Muhamedagic wrote:
> Hi,
>
> On Thu, Aug 20, 2009 at 11:01:43AM +0200, Thomas Glanzmann wrote:
>> Hello Terry,
>>
>>> What would cause the stonith 'start' operation to fail after it
>>> initially had succeeded?
>> if my understanding is correct (I wrote a stonith agent for vsphere
>> yesterday). Than it runs the status command of the stonith agent and
>> looks at the exist status, like that:
>>
>> (ha-01) [~] VI_SERVER=esx-03.glanzmann.de VI_USERNAME=root /usr/lib/stonith/plugins/external/vsphere status; echo $?
>> Enter password:
>> 0
>
> Right. The start operation includes a status. If the status
> operation fails, the start obviously fails too.
>
> Thanks,
>
> Dejan
>
>> Thomas
>> _______________________________________________


Ok, I understand that, but why would it intermittently fail if it initialy succeeds? These machines are not heavily loaded
and are by no means slow.

I guess the better question would be: How to I track down the culprit? Obviously, if a monitor or start command fail on the
stonith agent, then it will cause a "stonith reboot" operation of one or both of the nodes which shouldn't happen unless
theres a definite reason to do so.




_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Aug 20, 2009, 6:16 AM

Post #5 of 14 (1566 views)
Permalink
Re: stonith failed to start [In reply to]

On Thu, Aug 20, 2009 at 2:55 PM, Terry L.
Inzauro<tinzauro [at] ha-solutions> wrote:
> Dejan Muhamedagic wrote:
>> Hi,
>>
>> On Thu, Aug 20, 2009 at 11:01:43AM +0200, Thomas Glanzmann wrote:
>>> Hello Terry,
>>>
>>>> What would cause the stonith 'start' operation to fail after it
>>>> initially had succeeded?
>>> if my understanding is correct (I wrote a stonith agent for vsphere
>>> yesterday). Than it runs the status command of the stonith agent and
>>> looks at the exist status, like that:
>>>
>>> (ha-01) [~] VI_SERVER=esx-03.glanzmann.de VI_USERNAME=root /usr/lib/stonith/plugins/external/vsphere status; echo $?
>>> Enter password:
>>> 0
>>
>> Right. The start operation includes a status. If the status
>> operation fails, the start obviously fails too.
>>
>> Thanks,
>>
>> Dejan
>>
>>>         Thomas
>>> _______________________________________________
>
>
> Ok, I understand that, but why would it intermittently fail if it initialy succeeds?  These machines are not heavily loaded
> and are by no means slow.

Could be a timing issue. Some boxes only allow 1 simultaneous connection.

>
> I guess the better question would be:  How to I track down the culprit?  Obviously, if a monitor or start command fail on the
> stonith agent, then it will cause a "stonith reboot" operation of one or both of the nodes which shouldn't happen unless
> theres a definite reason to do so.
>
>
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Aug 20, 2009, 6:21 AM

Post #6 of 14 (1559 views)
Permalink
Re: stonith failed to start [In reply to]

Hi,

On Thu, Aug 20, 2009 at 07:55:07AM -0500, Terry L. Inzauro wrote:
> Dejan Muhamedagic wrote:
> > Hi,
> >
> > On Thu, Aug 20, 2009 at 11:01:43AM +0200, Thomas Glanzmann wrote:
> >> Hello Terry,
> >>
> >>> What would cause the stonith 'start' operation to fail after it
> >>> initially had succeeded?
> >> if my understanding is correct (I wrote a stonith agent for vsphere
> >> yesterday). Than it runs the status command of the stonith agent and
> >> looks at the exist status, like that:
> >>
> >> (ha-01) [~] VI_SERVER=esx-03.glanzmann.de VI_USERNAME=root /usr/lib/stonith/plugins/external/vsphere status; echo $?
> >> Enter password:
> >> 0
> >
> > Right. The start operation includes a status. If the status
> > operation fails, the start obviously fails too.
> >
> > Thanks,
> >
> > Dejan
> >
> >> Thomas
> >> _______________________________________________
>
>
> Ok, I understand that, but why would it intermittently fail if
> it initialy succeeds? These machines are not heavily loaded
> and are by no means slow.

That obviously depends on your stonith device. Unless you think
it works without problems in which case you must've found a well
hidden bug. If so, please file a bugzilla with a hb_report
report. BTW, external/ssh should be used _only_ for testing
purposes.

> I guess the better question would be: How to I track down the
> culprit? Obviously, if a monitor or start command fail on the
> stonith agent, then it will cause a "stonith reboot" operation

No, that shouldn't cause a reboot of any kind, unless you
specifically ask for it.

Thanks,

Dejan

> of one or both of the nodes which shouldn't happen unless
> theres a definite reason to do so.
>
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Aug 20, 2009, 6:24 AM

Post #7 of 14 (1555 views)
Permalink
Re: stonith failed to start [In reply to]

On Thu, Aug 20, 2009 at 03:16:56PM +0200, Andrew Beekhof wrote:
> On Thu, Aug 20, 2009 at 2:55 PM, Terry L.
> Inzauro<tinzauro [at] ha-solutions> wrote:
> > Dejan Muhamedagic wrote:
> >> Hi,
> >>
> >> On Thu, Aug 20, 2009 at 11:01:43AM +0200, Thomas Glanzmann wrote:
> >>> Hello Terry,
> >>>
> >>>> What would cause the stonith 'start' operation to fail after it
> >>>> initially had succeeded?
> >>> if my understanding is correct (I wrote a stonith agent for vsphere
> >>> yesterday). Than it runs the status command of the stonith agent and
> >>> looks at the exist status, like that:
> >>>
> >>> (ha-01) [~] VI_SERVER=esx-03.glanzmann.de VI_USERNAME=root /usr/lib/stonith/plugins/external/vsphere status; echo $?
> >>> Enter password:
> >>> 0
> >>
> >> Right. The start operation includes a status. If the status
> >> operation fails, the start obviously fails too.
> >>
> >> Thanks,
> >>
> >> Dejan
> >>
> >>>         Thomas
> >>> _______________________________________________
> >
> >
> > Ok, I understand that, but why would it intermittently fail if it initialy succeeds?  These machines are not heavily loaded
> > and are by no means slow.
>
> Could be a timing issue. Some boxes only allow 1 simultaneous connection.

Oh, right, of course, forgot about that, though in this case
(external/ssh) the device allows multiple simultaneous
connections. But it could be also due to timeouts.

Thanks,

Dejan

> >
> > I guess the better question would be:  How to I track down the culprit?  Obviously, if a monitor or start command fail on the
> > stonith agent, then it will cause a "stonith reboot" operation of one or both of the nodes which shouldn't happen unless
> > theres a definite reason to do so.
> >
> >
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA [at] lists
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


tinzauro at ha-solutions

Aug 20, 2009, 6:56 AM

Post #8 of 14 (1564 views)
Permalink
Re: stonith failed to start [In reply to]

Dejan Muhamedagic wrote:
> On Thu, Aug 20, 2009 at 03:16:56PM +0200, Andrew Beekhof wrote:
>> On Thu, Aug 20, 2009 at 2:55 PM, Terry L.
>> Inzauro<tinzauro [at] ha-solutions> wrote:
>>> Dejan Muhamedagic wrote:
>>>> Hi,
>>>>
>>>> On Thu, Aug 20, 2009 at 11:01:43AM +0200, Thomas Glanzmann wrote:
>>>>> Hello Terry,
>>>>>
>>>>>> What would cause the stonith 'start' operation to fail after it
>>>>>> initially had succeeded?
>>>>> if my understanding is correct (I wrote a stonith agent for vsphere
>>>>> yesterday). Than it runs the status command of the stonith agent and
>>>>> looks at the exist status, like that:
>>>>>
>>>>> (ha-01) [~] VI_SERVER=esx-03.glanzmann.de VI_USERNAME=root /usr/lib/stonith/plugins/external/vsphere status; echo $?
>>>>> Enter password:
>>>>> 0
>>>> Right. The start operation includes a status. If the status
>>>> operation fails, the start obviously fails too.
>>>>
>>>> Thanks,
>>>>
>>>> Dejan
>>>>
>>>>> Thomas
>>>>> _______________________________________________
>>>
>>> Ok, I understand that, but why would it intermittently fail if it initialy succeeds? These machines are not heavily loaded
>>> and are by no means slow.
>> Could be a timing issue. Some boxes only allow 1 simultaneous connection.
>
> Oh, right, of course, forgot about that, though in this case
> (external/ssh) the device allows multiple simultaneous
> connections. But it could be also due to timeouts.
>
> Thanks,
>
> Dejan
>
>>> I guess the better question would be: How to I track down the culprit? Obviously, if a monitor or start command fail on the
>>> stonith agent, then it will cause a "stonith reboot" operation of one or both of the nodes which shouldn't happen unless
>>> theres a definite reason to do so.
>>>
>>>
>>>
>>>



Ok. I am indeed using 'external/ssh' as the stonith device. I figure it was better than nothing as I do not have access to
a hardware stonith device. In you opinion, is using the 'external/ssh' plugin 'better' than NOT using a stonith plugin at all?


In the mean time, I'll bump up the timeouts from 20s to 40s and see how it goes.



thanks for the help.


_Terry


_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Aug 20, 2009, 6:58 AM

Post #9 of 14 (1557 views)
Permalink
Re: stonith failed to start [In reply to]

On Thu, Aug 20, 2009 at 3:56 PM, Terry L.
Inzauro<tinzauro [at] ha-solutions> wrote:

> Ok. I am indeed using 'external/ssh' as the stonith device.   I figure it was better than nothing as I do not have access to
> a hardware stonith device.  In you opinion, is using the 'external/ssh'  plugin 'better' than NOT using a stonith plugin at all?

personally, i think so.
but there are plenty that disagree.
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Aug 20, 2009, 7:20 AM

Post #10 of 14 (1571 views)
Permalink
Re: stonith failed to start [In reply to]

Hi,

On Thu, Aug 20, 2009 at 03:58:19PM +0200, Andrew Beekhof wrote:
> On Thu, Aug 20, 2009 at 3:56 PM, Terry L.
> Inzauro<tinzauro [at] ha-solutions> wrote:
>
> > Ok. I am indeed using 'external/ssh' as the stonith device.   I figure it was better than nothing as I do not have access to
> > a hardware stonith device.  In you opinion, is using the 'external/ssh'  plugin 'better' than NOT using a stonith plugin at all?
>
> personally, i think so.
> but there are plenty that disagree.

Ah, that would include me :)

If the stonith device fails to fence the failing node then there
is no failover and you get zero availability. The probability
that that happens is much higher when using a device such as
external/ssh since it depends on both the network availability
and the OS health. I'll leave it to you to figure out in how many
ways these two dependencies can hinder a fencing operation.

Thanks,

Dejan


> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


tinzauro at ha-solutions

Aug 20, 2009, 7:56 AM

Post #11 of 14 (1562 views)
Permalink
Re: stonith failed to start [In reply to]

Dejan Muhamedagic wrote:
> Hi,
>
> On Thu, Aug 20, 2009 at 03:58:19PM +0200, Andrew Beekhof wrote:
>> On Thu, Aug 20, 2009 at 3:56 PM, Terry L.
>> Inzauro<tinzauro [at] ha-solutions> wrote:
>>
>>> Ok. I am indeed using 'external/ssh' as the stonith device. I figure it was better than nothing as I do not have access to
>>> a hardware stonith device. In you opinion, is using the 'external/ssh' plugin 'better' than NOT using a stonith plugin at all?
>> personally, i think so.
>> but there are plenty that disagree.
>
> Ah, that would include me :)
>
> If the stonith device fails to fence the failing node then there
> is no failover and you get zero availability. The probability
> that that happens is much higher when using a device such as
> external/ssh since it depends on both the network availability
> and the OS health. I'll leave it to you to figure out in how many
> ways these two dependencies can hinder a fencing operation.
>
> Thanks,
>
> Dejan
>
>
>


Ahem.

How many ways to hinder, let me count the ways. Glad I got that out of my system. Now on to the business at hand.

--------------------

There may be many different failures, but I guess I would have to spit them into two groups: probably and improbable.

Probable list:
1. Physical network link failure
2. Ethernet switch fabric failure
3. administrator error (accidentally breaking network configurations including sshd breakage)

Improbable list:
1. IP stack failure
2. Unexpected OS errors (linux is pretty stable these days)
3. Ethernet adapter failure (i cant remember the last time i saw an Ethernet card fail)


Having said all that, one can derive a thought that assumes 99% of the failures are related to 'external/ssh' stonith device
are network related. So, my last question is:

Can the 'external/ssh' stonith plugin be configured to be "network fault tolerant". For instance:

<nvpair id="stonithclone-attr-1" name="hostlist" value="node1 node1-c node2 node2-c"/>

where:
node1 = communications over eth0 and switch0
node1-c = communications over eth1 via xover
node2 = communications over eth0 and switch0
node2-c = communications over eth1 via xover

the desired logic is this:

if node1 communication to node2 fails
then
use node1-c communications to node2-c
else
stonith thy self


i would say the probability of both links failing is slim. this setup would then alleviate the "probable" list. right?






_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Aug 20, 2009, 10:33 AM

Post #12 of 14 (1551 views)
Permalink
Re: stonith failed to start [In reply to]

On Thu, Aug 20, 2009 at 4:20 PM, Dejan Muhamedagic<dejanmm [at] fastmail> wrote:
> Hi,
>
> On Thu, Aug 20, 2009 at 03:58:19PM +0200, Andrew Beekhof wrote:
>> On Thu, Aug 20, 2009 at 3:56 PM, Terry L.
>> Inzauro<tinzauro [at] ha-solutions> wrote:
>>
>> > Ok. I am indeed using 'external/ssh' as the stonith device.   I figure it was better than nothing as I do not have access to
>> > a hardware stonith device.  In you opinion, is using the 'external/ssh'  plugin 'better' than NOT using a stonith plugin at all?
>>
>> personally, i think so.
>> but there are plenty that disagree.
>
> Ah, that would include me :)
>
> If the stonith device fails to fence the failing node then there
> is no failover and you get zero availability.

But at least your data isn't corrupt.
What's the point of being up if you're serving up garbage?

> The probability
> that that happens is much higher when using a device such as
> external/ssh since it depends on both the network availability
> and the OS health. I'll leave it to you to figure out in how many
> ways these two dependencies can hinder a fencing operation.
>
> Thanks,
>
> Dejan
>
>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA [at] lists
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Aug 21, 2009, 8:33 AM

Post #13 of 14 (1529 views)
Permalink
Re: stonith failed to start [In reply to]

On Thu, Aug 20, 2009 at 07:33:19PM +0200, Andrew Beekhof wrote:
> On Thu, Aug 20, 2009 at 4:20 PM, Dejan Muhamedagic<dejanmm [at] fastmail> wrote:
> > Hi,
> >
> > On Thu, Aug 20, 2009 at 03:58:19PM +0200, Andrew Beekhof wrote:
> >> On Thu, Aug 20, 2009 at 3:56 PM, Terry L.
> >> Inzauro<tinzauro [at] ha-solutions> wrote:
> >>
> >> > Ok. I am indeed using 'external/ssh' as the stonith device.   I figure it was better than nothing as I do not have access to
> >> > a hardware stonith device.  In you opinion, is using the 'external/ssh'  plugin 'better' than NOT using a stonith plugin at all?
> >>
> >> personally, i think so.
> >> but there are plenty that disagree.
> >
> > Ah, that would include me :)
> >
> > If the stonith device fails to fence the failing node then there
> > is no failover and you get zero availability.
>
> But at least your data isn't corrupt.
> What's the point of being up if you're serving up garbage?

Well, the idea was not to drop fencing, but to get a proper
device. It would be madness to run shared storage and non-cluster
filesystems without stonith. Besides, I really can't imagine a
good argument not to have a fencing device.

Cheers,

Dejan

> > The probability
> > that that happens is much higher when using a device such as
> > external/ssh since it depends on both the network availability
> > and the OS health. I'll leave it to you to figure out in how many
> > ways these two dependencies can hinder a fencing operation.
> >
> > Thanks,
> >
> > Dejan
> >
> >
> >> _______________________________________________
> >> Linux-HA mailing list
> >> Linux-HA [at] lists
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA [at] lists
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Aug 24, 2009, 1:20 AM

Post #14 of 14 (1457 views)
Permalink
Re: stonith failed to start [In reply to]

On Fri, Aug 21, 2009 at 5:33 PM, Dejan Muhamedagic<dejanmm [at] fastmail> wrote:
> On Thu, Aug 20, 2009 at 07:33:19PM +0200, Andrew Beekhof wrote:
>> On Thu, Aug 20, 2009 at 4:20 PM, Dejan Muhamedagic<dejanmm [at] fastmail> wrote:
>> > Hi,
>> >
>> > On Thu, Aug 20, 2009 at 03:58:19PM +0200, Andrew Beekhof wrote:
>> >> On Thu, Aug 20, 2009 at 3:56 PM, Terry L.
>> >> Inzauro<tinzauro [at] ha-solutions> wrote:
>> >>
>> >> > Ok. I am indeed using 'external/ssh' as the stonith device.   I figure it was better than nothing as I do not have access to
>> >> > a hardware stonith device.  In you opinion, is using the 'external/ssh'  plugin 'better' than NOT using a stonith plugin at all?
>> >>
>> >> personally, i think so.
>> >> but there are plenty that disagree.
>> >
>> > Ah, that would include me :)
>> >
>> > If the stonith device fails to fence the failing node then there
>> > is no failover and you get zero availability.
>>
>> But at least your data isn't corrupt.
>> What's the point of being up if you're serving up garbage?
>
> Well, the idea was not to drop fencing, but to get a proper
> device.

But that was exactly what he was asking... "Is ssh better than nothing?"
Without question ssh is the worst possible stonith option, but on
balance I still maintain its better than turning off stonith
completely.
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.