Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Dev

Problems when DC node is STONITH'ed.

 

 

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded


taniguchis at intellilink

Oct 14, 2008, 3:07 AM

Post #1 of 4 (1145 views)
Permalink
Problems when DC node is STONITH'ed.

Hi,

I found that there are 2 problems when DC node is STONITH'ed.
(1) STONITH operation is executed two times.
(2) Timeout-value which stonithd on DC node waits to reply
the result of STONITH op from other node is
always set to "stonith-timeout" in <cluster_property_set>.

The case (1):
(i) Stonithd on DC sends a request to stonithd on other node.
(ii) DC node is STONITH'ed!
(iii) Other node becomes DC.
(iv) No one can notify to new tengine that the STONITH succeeded.
(v) Transition timeout occurs on new tengine.
(vi) new tengine tries to STONITH again.
(vii) rebooted node (ex-DC node) is STONITH'ed again!
Is it the expected behavior?
(Maybe I think so, because target node must be STONITH'ed immediately,
and it can't wait for changing DC. Just to make sure.)
Just for reference, I attached logs.
The node named "stkdump2" is "ex-DC" and STONITH'ed.

The case (2):
When this timeout occurs on stonithd on DC
during non-DC node's stonithd tries to reset DC,
DC-stonithd will send a request to other node,
and two or more STONITH plugins are executed in parallel.
This is a troublesome problem.
The most suitable value as this timeout might be
the sum total of "stonith-timeout" of STONITH plugins on the node
which is going to receive the STONITH request from DC node, I think.
But DC node can't know that...
I would like to hear your opinions.


Best Regards,
Satomi TANIGUCHI
Attachments: hb_report.tar.gz (88.2 KB)


dejanmm at fastmail

Oct 14, 2008, 6:13 AM

Post #2 of 4 (1054 views)
Permalink
Re: Problems when DC node is STONITH'ed. [In reply to]

Hi Satomi-san,

On Tue, Oct 14, 2008 at 07:07:00PM +0900, Satomi TANIGUCHI wrote:
> Hi,
>
> I found that there are 2 problems when DC node is STONITH'ed.
> (1) STONITH operation is executed two times.

This has been discussed at length in bugzilla, see

http://developerbugs.linux-foundation.org/show_bug.cgi?id=1904

which was resolved with WONTFIX. In short, it was deemed to risky
to implement a remedy for this problem. Of course, if you think
you can add more to the discussion, please go ahead.

> (2) Timeout-value which stonithd on DC node waits to reply
> the result of STONITH op from other node is
> always set to "stonith-timeout" in <cluster_property_set>.
> [...]
> The case (2):
> When this timeout occurs on stonithd on DC
> during non-DC node's stonithd tries to reset DC,
> DC-stonithd will send a request to other node,
> and two or more STONITH plugins are executed in parallel.
> This is a troublesome problem.
> The most suitable value as this timeout might be
> the sum total of "stonith-timeout" of STONITH plugins on the node
> which is going to receive the STONITH request from DC node, I think.

This would probably be very difficult for the CRM to get.

> But DC node can't know that...
> I would like to hear your opinions.

Sorry, but I couldn't exactly follow. Could you please describe
it in terms of actions.

Thanks,

Dejan

> Best Regards,
> Satomi TANIGUCHI


> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Oct 14, 2008, 8:20 AM

Post #3 of 4 (1068 views)
Permalink
Re: Problems when DC node is STONITH'ed. [In reply to]

On 2008-10-14T15:13:07, Dejan Muhamedagic <dejanmm [at] fastmail> wrote:

> > I found that there are 2 problems when DC node is STONITH'ed.
> > (1) STONITH operation is executed two times.
>
> This has been discussed at length in bugzilla, see
>
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1904
>
> which was resolved with WONTFIX. In short, it was deemed to risky
> to implement a remedy for this problem. Of course, if you think
> you can add more to the discussion, please go ahead.

As part of shutting down before fencing, Pacemaker could consider
electing a new DC first (if a not-shutting-down node remains). Removing
nodes which are shutting down or "dirty" from the election should not be
too difficult?

This discussion belongs with pacemaker though, not the linux-ha-dev
list.


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


taniguchis at intellilink

Oct 15, 2008, 11:43 PM

Post #4 of 4 (1059 views)
Permalink
Re: [Pacemaker] Re: Problems when DC node is STONITH'ed. [In reply to]

Hi Dejan,


Dejan Muhamedagic wrote:
> Hi Satomi-san,
>
> On Tue, Oct 14, 2008 at 07:07:00PM +0900, Satomi TANIGUCHI wrote:
>> Hi,
>>
>> I found that there are 2 problems when DC node is STONITH'ed.
>> (1) STONITH operation is executed two times.
>
> This has been discussed at length in bugzilla, see
>
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1904
>
> which was resolved with WONTFIX. In short, it was deemed to risky
> to implement a remedy for this problem. Of course, if you think
> you can add more to the discussion, please go ahead.
Sorry, I missed it.
Thank you for your pointing!
I understand how it came about.

Ideally, when DC-node is going to be STONITH'ed,
the new DC-node is elected and it STONITHs the ex-DC,
then these problems will not occur.
But maybe it is not good way from the viewpoint of emergency
because the ex-DC should be STONITH'ed as soon as possible.

Anyway, I understand this is an expected behavior, thanks!
But then, it seems that tengine has to keep having a timeout for waiting
stonithd's result, and long cluster-delay is still required.
Because second STONITH is requested on that transition timeout.
I'm afraid that I misunderstood the true meaning of what Andrew said.

>
>> (2) Timeout-value which stonithd on DC node waits to reply
>> the result of STONITH op from other node is
>> always set to "stonith-timeout" in <cluster_property_set>.
>> [...]
>> The case (2):
>> When this timeout occurs on stonithd on DC
>> during non-DC node's stonithd tries to reset DC,
>> DC-stonithd will send a request to other node,
>> and two or more STONITH plugins are executed in parallel.
>> This is a troublesome problem.
>> The most suitable value as this timeout might be
>> the sum total of "stonith-timeout" of STONITH plugins on the node
>> which is going to receive the STONITH request from DC node, I think.
>
> This would probably be very difficult for the CRM to get.
Right, I agree with you.
I meant "it is difficult because stonithd on DC can't know the values of
stonith-timeout on other node." with the following sentence
"But DC node can't know that...".

>
>> But DC node can't know that...
>> I would like to hear your opinions.
>
> Sorry, but I couldn't exactly follow. Could you please describe
> it in terms of actions.
Sorry, I restate what I meant.
The timeout which stonithd on DC waits for the return of other node's
stonithd needs the value that is longer than the sum total of "stonith-timeout"
of STONITH plugins on the node by all rights.
But it is so difficult to get the values for DC-stonithd.
Then I would like to hear your opinion about what is suitable and practical
value as this timeout which is set in insert_into_executing_queue().
I hope I conveyed to you what I want to say.

For reference, I attached logs when the aforesaid timeout occurs.
The cluster has 3 nodes.
When DC was going to be STONITH'ed, DC sent a request all of non-DC nodes,
and all of them tried to shutdown DC.
And the timeout on DC-stonithd occured, DC-stonithd sent the same request,
then two or more STONITH plugin worked in parallel on every non-DC node.
(Please see sysstats.txt.)
I want to make clear whether the current behavior is expected or a bug.

But I consider that the root of every problem is the node which sends STONITH
request and wait for completion of the op is killed.


Regards,
Satomi TANIGUCHI


>
> Thanks,
>
> Dejan
>
>> Best Regards,
>> Satomi TANIGUCHI
>
>
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev [at] lists
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
Attachments: hb_report.tar.gz (48.3 KB)

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.