
taniguchis at intellilink
Oct 15, 2008, 11:43 PM
Post #4 of 4
(1059 views)
Permalink
|
|
Re: [Pacemaker] Re: Problems when DC node is STONITH'ed.
[In reply to]
|
|
Hi Dejan, Dejan Muhamedagic wrote: > Hi Satomi-san, > > On Tue, Oct 14, 2008 at 07:07:00PM +0900, Satomi TANIGUCHI wrote: >> Hi, >> >> I found that there are 2 problems when DC node is STONITH'ed. >> (1) STONITH operation is executed two times. > > This has been discussed at length in bugzilla, see > > http://developerbugs.linux-foundation.org/show_bug.cgi?id=1904 > > which was resolved with WONTFIX. In short, it was deemed to risky > to implement a remedy for this problem. Of course, if you think > you can add more to the discussion, please go ahead. Sorry, I missed it. Thank you for your pointing! I understand how it came about. Ideally, when DC-node is going to be STONITH'ed, the new DC-node is elected and it STONITHs the ex-DC, then these problems will not occur. But maybe it is not good way from the viewpoint of emergency because the ex-DC should be STONITH'ed as soon as possible. Anyway, I understand this is an expected behavior, thanks! But then, it seems that tengine has to keep having a timeout for waiting stonithd's result, and long cluster-delay is still required. Because second STONITH is requested on that transition timeout. I'm afraid that I misunderstood the true meaning of what Andrew said. > >> (2) Timeout-value which stonithd on DC node waits to reply >> the result of STONITH op from other node is >> always set to "stonith-timeout" in <cluster_property_set>. >> [...] >> The case (2): >> When this timeout occurs on stonithd on DC >> during non-DC node's stonithd tries to reset DC, >> DC-stonithd will send a request to other node, >> and two or more STONITH plugins are executed in parallel. >> This is a troublesome problem. >> The most suitable value as this timeout might be >> the sum total of "stonith-timeout" of STONITH plugins on the node >> which is going to receive the STONITH request from DC node, I think. > > This would probably be very difficult for the CRM to get. Right, I agree with you. I meant "it is difficult because stonithd on DC can't know the values of stonith-timeout on other node." with the following sentence "But DC node can't know that...". > >> But DC node can't know that... >> I would like to hear your opinions. > > Sorry, but I couldn't exactly follow. Could you please describe > it in terms of actions. Sorry, I restate what I meant. The timeout which stonithd on DC waits for the return of other node's stonithd needs the value that is longer than the sum total of "stonith-timeout" of STONITH plugins on the node by all rights. But it is so difficult to get the values for DC-stonithd. Then I would like to hear your opinion about what is suitable and practical value as this timeout which is set in insert_into_executing_queue(). I hope I conveyed to you what I want to say. For reference, I attached logs when the aforesaid timeout occurs. The cluster has 3 nodes. When DC was going to be STONITH'ed, DC sent a request all of non-DC nodes, and all of them tried to shutdown DC. And the timeout on DC-stonithd occured, DC-stonithd sent the same request, then two or more STONITH plugin worked in parallel on every non-DC node. (Please see sysstats.txt.) I want to make clear whether the current behavior is expected or a bug. But I consider that the root of every problem is the node which sends STONITH request and wait for completion of the op is killed. Regards, Satomi TANIGUCHI > > Thanks, > > Dejan > >> Best Regards, >> Satomi TANIGUCHI > > >> _______________________________________________________ >> Linux-HA-Dev: Linux-HA-Dev [at] lists >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >> Home Page: http://linux-ha.org/ > > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev [at] lists > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/
|