Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

stonith riloe - nodes kill each other

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


jandot at googlemail

Jun 26, 2009, 1:55 AM

Post #1 of 9 (1046 views)
Permalink
stonith riloe - nodes kill each other

Hi,

a very boring issue with stonith using the plugin external/riloe (never used
it). Whenever I try to simulate a split-brain condition (using iptables) in
order to test stonith, both nodes kill each other. Not exactly what
expected.

Below the interesting part of my CIB.

Thanks,
Jan

<crm_config>

<cluster_property_set
id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
value="1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a"/>

<nvpair id="cib-bootstrap-options-expected-quorum-votes"
name="expected-quorum-votes"
value="2"/>

<nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
name="no-quorum-policy"
value="ignore"/>

<nvpair id="cib-bootstrap-options-last-lrm-refresh"
name="last-lrm-refresh"
value="1245941613"/>

<nvpair id="nvpair-1ec8a168-450b-497a-9016-7f21b9cc2fb4"
name="default-action-timeout"
value="60s"/>

<nvpair id="nvpair-eaeaf6ed-8da1-4cbb-826f-bfb0d6831866"
name="stonith-action"
value="poweroff"/>

<nvpair id="nvpair-6360fc9e-a158-4fb1-849d-bfa43ca4d728"
name="startup-fencing" value="false"/>

</cluster_property_set>

</crm_config>
[...]
<primitive class="stonith" id="stonith-on-xen01" type="external/riloe">
<meta_attributes id="stonith-on-xen01-meta_attributes">
<nvpair id="nvpair-073c49cb-8919-43a0-bf26-614e0d56fc98"
name="target-role" value="started"/>
</meta_attributes>
<operations id="stonith-on-xen01-operations">
<op id="stonith-on-xen01-op-monitor-15" interval="3600"
name="monitor" start-delay="15" timeout="15"/>
</operations>
<instance_attributes id="stonith-on-xen01-instance_attributes">
<nvpair id="nvpair-5c4be676-42b2-45c7-8508-7ee5fb4cac5c"
name="hostlist" value="xen02"/>
<nvpair id="nvpair-3c514c1a-f9af-4421-944d-dc5c5ebf44ab"
name="ilo_hostname" value="192.168.1.66"/>
<nvpair id="nvpair-19395c9d-6f56-4d69-b0a1-1694c1039a3b"
name="ilo_user" value="ha"/>
<nvpair id="nvpair-eb38a657-c4d3-4801-9457-dee918493c0a"
name="ilo_password" value="12345678"/>
<nvpair id="nvpair-992ed6ba-c658-412f-8b64-bc18b4e8c418"
name="ilo_protocol" value="2.0"/>
<nvpair id="nvpair-07f3dea5-7e66-484e-a075-2517c92d97f9"
name="ilo_powerdown_method" value="power"/>
<nvpair id="nvpair-30623255-ce90-454b-8805-9c630fc051b0"
name="ilo_can_reset" value="1"/>
</instance_attributes>
</primitive>
[...]
<primitive class="stonith" id="stonith-on-xen02" type="external/riloe">
<meta_attributes id="stonith-on-xen02-meta_attributes">
<nvpair id="nvpair-3c3e2616-5602-4b90-8d46-feb59269320a"
name="target-role" value="started"/>
</meta_attributes>
<operations id="stonith-on-xen02-operations">
<op id="stonith-on-xen02-op-monitor-15" interval="3600"
name="monitor" start-delay="15" timeout="15"/>
</operations>
<instance_attributes id="stonith-on-xen02-instance_attributes">
<nvpair id="nvpair-27638270-c354-4568-87f4-d561955c66d9"
name="hostlist" value="xen01"/>
<nvpair id="nvpair-f6a13848-c329-4c8e-958f-cab8e2bc5d1f"
name="ilo_hostname" value="192.168.1.62"/>
<nvpair id="nvpair-78d03fba-3263-459f-862f-e2ceb3fce758"
name="ilo_user" value="ha"/>
<nvpair id="nvpair-7171ff49-ada7-4b63-a1dd-3829fe1d09f6"
name="ilo_password" value="12345678"/>
<nvpair id="nvpair-f80def4b-34aa-4299-a07c-7dc7c0b7860d"
name="ilo_protocol" value="2.0"/>
<nvpair id="nvpair-c682a750-6798-424e-aa47-7f800c334597"
name="ilo_powerdown_method" value="power"/>
<nvpair id="nvpair-057ca3eb-f280-435c-a89b-3921f942a191"
name="ilo_can_reset" value="1"/>
</instance_attributes>
</primitive>
[...]
<rsc_location id="local-stonith-on-xen01" node="xen02"
rsc="stonith-on-xen01" score="-INFINITY"/>
<rsc_location id="local-stonith-on-xen02" node="xen01"
rsc="stonith-on-xen02" score="-INFINITY"/>
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Jun 26, 2009, 1:59 AM

Post #2 of 9 (1008 views)
Permalink
Re: stonith riloe - nodes kill each other [In reply to]

On Fri, Jun 26, 2009 at 10:55 AM, Jan<jandot [at] googlemail> wrote:
> Hi,
>
> a very boring issue with stonith using the plugin external/riloe (never used
> it). Whenever I try to simulate a split-brain condition (using iptables) in
> order to test stonith, both nodes kill each other. Not exactly what
> expected.

Sure it is

[snip]

>        <nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
> name="no-quorum-policy"
> value="ignore"/>

With that option, this is exactly what I'd expect.

Have a read of:
http://ourobengr.com/ha
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


jandot at googlemail

Jun 26, 2009, 6:07 AM

Post #3 of 9 (1014 views)
Permalink
Re: stonith riloe - nodes kill each other [In reply to]

Andrew Beekhof wrote:
> On Fri, Jun 26, 2009 at 10:55 AM, Jan<jandot [at] googlemail> wrote:
>
>> Hi,
>>
>> a very boring issue with stonith using the plugin external/riloe (never used
>> it). Whenever I try to simulate a split-brain condition (using iptables) in
>> order to test stonith, both nodes kill each other. Not exactly what
>> expected.
>>
>
> Sure it is
>
> [snip]
>
>
>> <nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
>> name="no-quorum-policy"
>> value="ignore"/>
>>
>
> With that option, this is exactly what I'd expect.
>
> Have a read of:
> http://ourobengr.com/ha
>
For what I understood, probably wrongly, that should be the right option
for a two nodes cluster, where only one node can't have quorum, that's
why should be "ignore". Is this wrong?

I had already taken a quick look at that document (I love that picture
btw) but not as deeply as now. I am going to review my timeout for sure.
Anyway, I don't get any hint about the quorum setting. Should it be
different that "ignore"?

My issue isn't exactly the deathmatch described there, first of all
because the openais daemon is disable at boot and secondly because the
stonith policy is poweroff. Rather, is a strange situation where both
nodes kill themselves and they both shutdown.

I wonder if it is a timeout issue. My timeout here for the stonith
resource is 15s. Does it mean that when a stonith is sent by the first
node to the second one and this node can't shutdown itself in 15s, it
stonith the first node?

Thanks,
Jan
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Jun 26, 2009, 6:17 AM

Post #4 of 9 (1016 views)
Permalink
Re: stonith riloe - nodes kill each other [In reply to]

On Fri, Jun 26, 2009 at 3:07 PM, Jan Kalcic<jandot [at] googlemail> wrote:
> Andrew Beekhof wrote:
>> On Fri, Jun 26, 2009 at 10:55 AM, Jan<jandot [at] googlemail> wrote:
>>
>>> Hi,
>>>
>>> a very boring issue with stonith using the plugin external/riloe (never used
>>> it). Whenever I try to simulate a split-brain condition (using iptables) in
>>> order to test stonith, both nodes kill each other. Not exactly what
>>> expected.
>>>
>>
>> Sure it is
>>
>> [snip]
>>
>>
>>>        <nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
>>> name="no-quorum-policy"
>>> value="ignore"/>
>>>
>>
>> With that option, this is exactly what I'd expect.
>>
>> Have a read of:
>>    http://ourobengr.com/ha
>>
> For what I understood, probably wrongly, that should be the right option
> for a two nodes cluster, where only one node can't have quorum, that's
> why should be "ignore". Is this wrong?
>
> I had already taken a quick look at that document (I love that picture
> btw) but not as deeply as now. I am going to review my timeout for sure.
> Anyway, I don't get any hint about the quorum setting. Should it be
> different that "ignore"?

No, thats the right value for a two node cluster.
But that value can also leads to the behavior you described.

Though normally one side shoots the other before it can shoot back.

> My issue isn't exactly the deathmatch described there, first of all
> because the openais daemon is disable at boot and secondly because the
> stonith policy is poweroff. Rather, is a strange situation where both
> nodes kill themselves and they both shutdown.

They'd both be killing each other.

> I wonder if it is a timeout issue. My timeout here for the stonith
> resource is 15s. Does it mean that when a stonith is sent by the first
> node to the second one and this node can't shutdown itself in 15s, it
> stonith the first node?

No. This is unrelated
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


jandot at googlemail

Jun 26, 2009, 7:33 AM

Post #5 of 9 (1007 views)
Permalink
Re: stonith riloe - nodes kill each other [In reply to]

Andrew Beekhof wrote:
> On Fri, Jun 26, 2009 at 3:07 PM, Jan Kalcic<jandot [at] googlemail> wrote:
>
>> Andrew Beekhof wrote:
>>
>>> On Fri, Jun 26, 2009 at 10:55 AM, Jan<jandot [at] googlemail> wrote:
>>>
>>>
>>>> Hi,
>>>>
>>>> a very boring issue with stonith using the plugin external/riloe (never used
>>>> it). Whenever I try to simulate a split-brain condition (using iptables) in
>>>> order to test stonith, both nodes kill each other. Not exactly what
>>>> expected.
>>>>
>>>>
>>> Sure it is
>>>
>>> [snip]
>>>
>>>
>>>
>>>> <nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
>>>> name="no-quorum-policy"
>>>> value="ignore"/>
>>>>
>>>>
>>> With that option, this is exactly what I'd expect.
>>>
>>> Have a read of:
>>> http://ourobengr.com/ha
>>>
>>>
>> For what I understood, probably wrongly, that should be the right option
>> for a two nodes cluster, where only one node can't have quorum, that's
>> why should be "ignore". Is this wrong?
>>
>> I had already taken a quick look at that document (I love that picture
>> btw) but not as deeply as now. I am going to review my timeout for sure.
>> Anyway, I don't get any hint about the quorum setting. Should it be
>> different that "ignore"?
>>
>
> No, thats the right value for a two node cluster.
> But that value can also leads to the behavior you described.
>
> Though normally one side shoots the other before it can shoot back.
>
This does not happen. The reason could be that usin iLO the node is not
actually shot but gracefully shutdown. For this reason the shot node has
all the time to shoot the other side back. Make sense?

In this case I would need to stonith the other side not gracefully but
strongly like unplugging the cable but it seems this is not available
with the riloe plugin, is it?

Thanks,
Jan
>> My issue isn't exactly the deathmatch described there, first of all
>> because the openais daemon is disable at boot and secondly because this
>> stonith policy is poweroff. Rather, is a strange situation where both
>> nodes kill themselves and they both shutdown.
>>
>
> They'd both be killing each other.
>
>
>> I wonder if it is a timeout issue. My timeout here for the stonith
>> resource is 15s. Does it mean that when a stonith is sent by the first
>> node to the second one and this node can't shutdown itself in 15s, it
>> stonith the first node?
>>
>
> No. This is unrelated
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Jun 29, 2009, 6:22 AM

Post #6 of 9 (978 views)
Permalink
Re: stonith riloe - nodes kill each other [In reply to]

Hi,

On Fri, Jun 26, 2009 at 04:33:30PM +0200, Jan Kalcic wrote:
> Andrew Beekhof wrote:
> > On Fri, Jun 26, 2009 at 3:07 PM, Jan Kalcic<jandot [at] googlemail> wrote:
> >
> >> Andrew Beekhof wrote:
> >>
> >>> On Fri, Jun 26, 2009 at 10:55 AM, Jan<jandot [at] googlemail> wrote:
> >>>
> >>>
> >>>> Hi,
> >>>>
> >>>> a very boring issue with stonith using the plugin external/riloe (never used
> >>>> it). Whenever I try to simulate a split-brain condition (using iptables) in
> >>>> order to test stonith, both nodes kill each other. Not exactly what
> >>>> expected.
> >>>>
> >>>>
> >>> Sure it is
> >>>
> >>> [snip]
> >>>
> >>>
> >>>
> >>>> <nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
> >>>> name="no-quorum-policy"
> >>>> value="ignore"/>
> >>>>
> >>>>
> >>> With that option, this is exactly what I'd expect.
> >>>
> >>> Have a read of:
> >>> http://ourobengr.com/ha
> >>>
> >>>
> >> For what I understood, probably wrongly, that should be the right option
> >> for a two nodes cluster, where only one node can't have quorum, that's
> >> why should be "ignore". Is this wrong?
> >>
> >> I had already taken a quick look at that document (I love that picture
> >> btw) but not as deeply as now. I am going to review my timeout for sure.
> >> Anyway, I don't get any hint about the quorum setting. Should it be
> >> different that "ignore"?
> >>
> >
> > No, thats the right value for a two node cluster.
> > But that value can also leads to the behavior you described.
> >
> > Though normally one side shoots the other before it can shoot back.
> >
> This does not happen. The reason could be that usin iLO the node is not
> actually shot but gracefully shutdown. For this reason the shot node has
> all the time to shoot the other side back. Make sense?

Yes, it does.

> In this case I would need to stonith the other side not gracefully but
> strongly like unplugging the cable but it seems this is not available
> with the riloe plugin, is it?

Yes, it is. You should use the latest version of the plugin.

ilo_powerdown_method should be set to power, AFAIK. I think that
that does a "cable pull" operation. If you still find a problem
with nodes shooting each other at the same time, please file a
bugzilla. I'm not sure if that can be fixed, depends on the
timings when talking to the device.

Thanks,

Dejan



> Thanks,
> Jan
> >> My issue isn't exactly the deathmatch described there, first of all
> >> because the openais daemon is disable at boot and secondly because this
> >> stonith policy is poweroff. Rather, is a strange situation where both
> >> nodes kill themselves and they both shutdown.
> >>
> >
> > They'd both be killing each other.
> >
> >
> >> I wonder if it is a timeout issue. My timeout here for the stonith
> >> resource is 15s. Does it mean that when a stonith is sent by the first
> >> node to the second one and this node can't shutdown itself in 15s, it
> >> stonith the first node?
> >>
> >
> > No. This is unrelated
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA [at] lists
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


jandot at googlemail

Jul 1, 2009, 9:31 AM

Post #7 of 9 (942 views)
Permalink
Re: stonith riloe - nodes kill each other [In reply to]

Dejan Muhamedagic wrote:
> Hi,
>
> On Fri, Jun 26, 2009 at 04:33:30PM +0200, Jan Kalcic wrote:
>
>> Andrew Beekhof wrote:
>>
>>> On Fri, Jun 26, 2009 at 3:07 PM, Jan Kalcic<jandot [at] googlemail> wrote:
>>>
>>>
>>>> Andrew Beekhof wrote:
>>>>
>>>>
>>>>> On Fri, Jun 26, 2009 at 10:55 AM, Jan<jandot [at] googlemail> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> a very boring issue with stonith using the plugin external/riloe (never used
>>>>>> it). Whenever I try to simulate a split-brain condition (using iptables) in
>>>>>> order to test stonith, both nodes kill each other. Not exactly what
>>>>>> expected.
>>>>>>
>>>>>>
>>>>>>
>>>>> Sure it is
>>>>>
>>>>> [snip]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> <nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
>>>>>> name="no-quorum-policy"
>>>>>> value="ignore"/>
>>>>>>
>>>>>>
>>>>>>
>>>>> With that option, this is exactly what I'd expect.
>>>>>
>>>>> Have a read of:
>>>>> http://ourobengr.com/ha
>>>>>
>>>>>
>>>>>
>>>> For what I understood, probably wrongly, that should be the right option
>>>> for a two nodes cluster, where only one node can't have quorum, that's
>>>> why should be "ignore". Is this wrong?
>>>>
>>>> I had already taken a quick look at that document (I love that picture
>>>> btw) but not as deeply as now. I am going to review my timeout for sure.
>>>> Anyway, I don't get any hint about the quorum setting. Should it be
>>>> different that "ignore"?
>>>>
>>>>
>>> No, thats the right value for a two node cluster.
>>> But that value can also leads to the behavior you described.
>>>
>>> Though normally one side shoots the other before it can shoot back.
>>>
>>>
>> This does not happen. The reason could be that usin iLO the node is not
>> actually shot but gracefully shutdown. For this reason the shot node has
>> all the time to shoot the other side back. Make sense?
>>
>
> Yes, it does.
>
>
>> In this case I would need to stonith the other side not gracefully but
>> strongly like unplugging the cable but it seems this is not available
>> with the riloe plugin, is it?
>>
>
> Yes, it is. You should use the latest version of the plugin.
>

I checked the plugin's version and it seems to be the very last one. It
is the one installed with SLES11-HA. A diff with the plugin available on
the openSuSE build service for openSuSE 11.1 reports they are the same.
> ilo_powerdown_method should be set to power, AFAIK. I think that
> that does a "cable pull" operation. If you still find a problem
> with nodes shooting each other at the same time, please file a
> bugzilla. I'm not sure if that can be fixed, depends on the
> timings when talking to the device.
>

I will try with the power option in the next few days. What let me
confused is the description below I extracted from the plugin. "power"
takes longer than button. I would expect it is shoot the node
immediately in order to not be stonith back.

<shortdesc lang="en">Power down method</shortdesc>
<longdesc lang="en">
The method to powerdown the host in question.
* button - Emulate holding down the power button
* power - Emulate turning off the machines power

NB: A button request takes around 20 seconds. The power method
about half a minute.

Thanks,
Jan
> Thanks,
>
> Dejan
>
>
>
>
>> Thanks,
>> Jan
>>
>>>> My issue isn't exactly the deathmatch described there, first of all
>>>> because the openais daemon is disable at boot and secondly because this
>>>> stonith policy is poweroff. Rather, is a strange situation where both
>>>> nodes kill themselves and they both shutdown.
>>>>
>>>>
>>> They'd both be killing each other.
>>>
>>>
>>>
>>>> I wonder if it is a timeout issue. My timeout here for the stonith
>>>> resource is 15s. Does it mean that when a stonith is sent by the first
>>>> node to the second one and this node can't shutdown itself in 15s, it
>>>> stonith the first node?
>>>>
>>>>
>>> No. This is unrelated
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA [at] lists
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>>>
>>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA [at] lists
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


jandot at googlemail

Jul 3, 2009, 2:04 AM

Post #8 of 9 (924 views)
Permalink
Re: stonith riloe - nodes kill each other [In reply to]

Jan Kalcic wrote:
> Dejan Muhamedagic wrote:
>
>> Hi,
>>
>> On Fri, Jun 26, 2009 at 04:33:30PM +0200, Jan Kalcic wrote:
>>
>>
>>> Andrew Beekhof wrote:
>>>
>>>
>>>> On Fri, Jun 26, 2009 at 3:07 PM, Jan Kalcic<jandot [at] googlemail> wrote:
>>>>
>>>>
>>>>
>>>>> Andrew Beekhof wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Fri, Jun 26, 2009 at 10:55 AM, Jan<jandot [at] googlemail> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> a very boring issue with stonith using the plugin external/riloe (never used
>>>>>>> it). Whenever I try to simulate a split-brain condition (using iptables) in
>>>>>>> order to test stonith, both nodes kill each other. Not exactly what
>>>>>>> expected.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Sure it is
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> <nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
>>>>>>> name="no-quorum-policy"
>>>>>>> value="ignore"/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> With that option, this is exactly what I'd expect.
>>>>>>
>>>>>> Have a read of:
>>>>>> http://ourobengr.com/ha
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> For what I understood, probably wrongly, that should be the right option
>>>>> for a two nodes cluster, where only one node can't have quorum, that's
>>>>> why should be "ignore". Is this wrong?
>>>>>
>>>>> I had already taken a quick look at that document (I love that picture
>>>>> btw) but not as deeply as now. I am going to review my timeout for sure.
>>>>> Anyway, I don't get any hint about the quorum setting. Should it be
>>>>> different that "ignore"?
>>>>>
>>>>>
>>>>>
>>>> No, thats the right value for a two node cluster.
>>>> But that value can also leads to the behavior you described.
>>>>
>>>> Though normally one side shoots the other before it can shoot back.
>>>>
>>>>
>>>>
>>> This does not happen. The reason could be that usin iLO the node is not
>>> actually shot but gracefully shutdown. For this reason the shot node has
>>> all the time to shoot the other side back. Make sense?
>>>
>>>
>> Yes, it does.
>>
>>
>>
>>> In this case I would need to stonith the other side not gracefully but
>>> strongly like unplugging the cable but it seems this is not available
>>> with the riloe plugin, is it?
>>>
>>>
>> Yes, it is. You should use the latest version of the plugin.
>>
>>
>
> I checked the plugin's version and it seems to be the very last one. It
> is the one installed with SLES11-HA. A diff with the plugin available on
> the openSuSE build service for openSuSE 11.1 reports they are the same.
>
>> ilo_powerdown_method should be set to power, AFAIK. I think that
>> that does a "cable pull" operation. If you still find a problem
>> with nodes shooting each other at the same time, please file a
>> bugzilla. I'm not sure if that can be fixed, depends on the
>> timings when talking to the device.
>>
>>
>
> I will try with the power option in the next few days. What let me
> confused is the description below I extracted from the plugin. "power"
> takes longer than button. I would expect it is shoot the node
> immediately in order to not be stonith back.
>
> <shortdesc lang="en">Power down method</shortdesc>
> <longdesc lang="en">
> The method to powerdown the host in question.
> * button - Emulate holding down the power button
> * power - Emulate turning off the machines power
>
> NB: A button request takes around 20 seconds. The power method
> about half a minute.
>
>
Ok, actually the power method was the one I was already using. What I
changed is the stonith action from poweroff, which shutdown gracefully
the node, to reboot which actually reboot the server but it also resets
it in few seconds.Deadthmatch no longer occur. From command line I
managed to stonith the node just like I want. Reset and with no reboot,
(-T reset) but I could not "move" this command into pacemaker.

Thanks,
Jan

> Thanks,
> Jan
>
>> Thanks,
>>
>> Dejan
>>
>>
>>
>>
>>
>>> Thanks,
>>> Jan
>>>
>>>
>>>>> My issue isn't exactly the deathmatch described there, first of all
>>>>> because the openais daemon is disable at boot and secondly because this
>>>>> stonith policy is poweroff. Rather, is a strange situation where both
>>>>> nodes kill themselves and they both shutdown.
>>>>>
>>>>>
>>>>>
>>>> They'd both be killing each other.
>>>>
>>>>
>>>>
>>>>
>>>>> I wonder if it is a timeout issue. My timeout here for the stonith
>>>>> resource is 15s. Does it mean that when a stonith is sent by the first
>>>>> node to the second one and this node can't shutdown itself in 15s, it
>>>>> stonith the first node?
>>>>>
>>>>>
>>>>>
>>>> No. This is unrelated
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> Linux-HA [at] lists
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>>>
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA [at] lists
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA [at] lists
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>>
>>
>
>
>

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Jul 3, 2009, 3:03 AM

Post #9 of 9 (909 views)
Permalink
Re: stonith riloe - nodes kill each other [In reply to]

Hi,

On Fri, Jul 03, 2009 at 11:04:11AM +0200, Jan Kalcic wrote:
> Jan Kalcic wrote:
> > Dejan Muhamedagic wrote:
> >
> >> Hi,
> >>
> >> On Fri, Jun 26, 2009 at 04:33:30PM +0200, Jan Kalcic wrote:
> >>
> >>
> >>> Andrew Beekhof wrote:
> >>>
> >>>
> >>>> On Fri, Jun 26, 2009 at 3:07 PM, Jan Kalcic<jandot [at] googlemail> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Andrew Beekhof wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Fri, Jun 26, 2009 at 10:55 AM, Jan<jandot [at] googlemail> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> a very boring issue with stonith using the plugin external/riloe (never used
> >>>>>>> it). Whenever I try to simulate a split-brain condition (using iptables) in
> >>>>>>> order to test stonith, both nodes kill each other. Not exactly what
> >>>>>>> expected.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> Sure it is
> >>>>>>
> >>>>>> [snip]
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> <nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
> >>>>>>> name="no-quorum-policy"
> >>>>>>> value="ignore"/>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> With that option, this is exactly what I'd expect.
> >>>>>>
> >>>>>> Have a read of:
> >>>>>> http://ourobengr.com/ha
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> For what I understood, probably wrongly, that should be the right option
> >>>>> for a two nodes cluster, where only one node can't have quorum, that's
> >>>>> why should be "ignore". Is this wrong?
> >>>>>
> >>>>> I had already taken a quick look at that document (I love that picture
> >>>>> btw) but not as deeply as now. I am going to review my timeout for sure.
> >>>>> Anyway, I don't get any hint about the quorum setting. Should it be
> >>>>> different that "ignore"?
> >>>>>
> >>>>>
> >>>>>
> >>>> No, thats the right value for a two node cluster.
> >>>> But that value can also leads to the behavior you described.
> >>>>
> >>>> Though normally one side shoots the other before it can shoot back.
> >>>>
> >>>>
> >>>>
> >>> This does not happen. The reason could be that usin iLO the node is not
> >>> actually shot but gracefully shutdown. For this reason the shot node has
> >>> all the time to shoot the other side back. Make sense?
> >>>
> >>>
> >> Yes, it does.
> >>
> >>
> >>
> >>> In this case I would need to stonith the other side not gracefully but
> >>> strongly like unplugging the cable but it seems this is not available
> >>> with the riloe plugin, is it?
> >>>
> >>>
> >> Yes, it is. You should use the latest version of the plugin.
> >>
> >>
> >
> > I checked the plugin's version and it seems to be the very last one. It
> > is the one installed with SLES11-HA. A diff with the plugin available on
> > the openSuSE build service for openSuSE 11.1 reports they are the same.
> >
> >> ilo_powerdown_method should be set to power, AFAIK. I think that
> >> that does a "cable pull" operation. If you still find a problem
> >> with nodes shooting each other at the same time, please file a
> >> bugzilla. I'm not sure if that can be fixed, depends on the
> >> timings when talking to the device.
> >>
> >>
> >
> > I will try with the power option in the next few days. What let me
> > confused is the description below I extracted from the plugin. "power"
> > takes longer than button. I would expect it is shoot the node
> > immediately in order to not be stonith back.
> >
> > <shortdesc lang="en">Power down method</shortdesc>
> > <longdesc lang="en">
> > The method to powerdown the host in question.
> > * button - Emulate holding down the power button
> > * power - Emulate turning off the machines power
> >
> > NB: A button request takes around 20 seconds. The power method
> > about half a minute.
> >
> >
> Ok, actually the power method was the one I was already using. What I
> changed is the stonith action from poweroff, which shutdown gracefully
> the node, to reboot which actually reboot the server but it also resets
> it in few seconds.

Not sure if I understand this. poweroff does result in the
SET_HOST_POWER request which should just remove the power from
the host. But, if ilo_can_reset is '0' then reset is also
poweroff followed by poweron. Perhaps you can set also
ilo_can_reset. That would make the plugin use the actual ilo
reset command. Some ilos don't support it though.

> Deadthmatch no longer occur. From command line I
> managed to stonith the node just like I want. Reset and with no reboot,
> (-T reset) but I could not "move" this command into pacemaker.

Strange, since the stonith program uses the same plugin.

Thanks,

Dejan

> Thanks,
> Jan
>
> > Thanks,
> > Jan
> >
> >> Thanks,
> >>
> >> Dejan
> >>
> >>
> >>
> >>
> >>
> >>> Thanks,
> >>> Jan
> >>>
> >>>
> >>>>> My issue isn't exactly the deathmatch described there, first of all
> >>>>> because the openais daemon is disable at boot and secondly because this
> >>>>> stonith policy is poweroff. Rather, is a strange situation where both
> >>>>> nodes kill themselves and they both shutdown.
> >>>>>
> >>>>>
> >>>>>
> >>>> They'd both be killing each other.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> I wonder if it is a timeout issue. My timeout here for the stonith
> >>>>> resource is 15s. Does it mean that when a stonith is sent by the first
> >>>>> node to the second one and this node can't shutdown itself in 15s, it
> >>>>> stonith the first node?
> >>>>>
> >>>>>
> >>>>>
> >>>> No. This is unrelated
> >>>> _______________________________________________
> >>>> Linux-HA mailing list
> >>>> Linux-HA [at] lists
> >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>>> See also: http://linux-ha.org/ReportingProblems
> >>>>
> >>>>
> >>>>
> >>>>
> >>> _______________________________________________
> >>> Linux-HA mailing list
> >>> Linux-HA [at] lists
> >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>> See also: http://linux-ha.org/ReportingProblems
> >>>
> >>>
> >> _______________________________________________
> >> Linux-HA mailing list
> >> Linux-HA [at] lists
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> >>
> >>
> >>
> >
> >
> >
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.