Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

New problem(s) with heartbeat 2.0.3 and STONITH

 

 

First page Previous page 1 2 3 Next page Last page  View All Linux-HA users RSS feed   Index | Next | Previous | View Threaded


hasjd at cn

Nov 9, 2005, 7:17 PM

Post #51 of 55 (704 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Stefan Peinkofer wrote:
> Hello Sun Jiang Dong and Guochun Shi,
>
> On Wed, 2005-11-16 at 17:07 +0800, Sun Jiang Dong wrote:
>
>>Hi Stefan Peinkofer,
>>
>>I removed ZAPCHAN according to gshi's suggestion. Can you have a try
>
> of CVS HEAD
>
>>again. Thanks a lots in advance.
>>
>
> I tried the CVS HEAD (with Sun's patch applied) but nothing has changed.
> I have attached the logfile.
> Question: If I look at the end of the logfile I see that some time after
> the segfault, the messages from Sun's patch disappear. Is this normal?
> (It was not so in the prior CVS HEAD version)
No, as i see, the message is till there. Anyway the time tags is a little
confusing. You can go forward using the pid as the thread.

gshi, this is the result after removing all ZAPCHAN in stonithd.
Do you have any discovery on it?


>
>
>>>>>>>Im not 100 percent sure yet, but it seems, that if stonithd
>>>>>>>
>>>>>>>>segfaulted one time, and therefore no monitor operations are
>>>>>>>>carried out anymore it will not segfault anymore. So maybe the
>>>>>>>>monitor operation causes the segfault somehow???
>>>>>>>>(Just wanted to mention that, perhaps it's helpful
>
> Note, I let heartbeat run over the night, tonight.
> After the segfault on sarek at 19:03 stonithd segfaulted not again.
> (Watched until 11:40 on the next day because then I stopped and started
> the new CVS HEAD version)
>
> Many thanks in advance.
> Stefan Peinkofer
>
>
>>Guochun Shi wrote:
>>
>>>Nov 8 19:03:35 sarek stonithd: [4038]: info: msg2ipcchan:1971:
>
> Will
>
>>>audit the ha_msg.
>>>Nov 8 19:03:35 sarek stonithd: [4038]: info: msg2ipcchan:1975:
>
> Will
>
>>>detect the status of the channel as an indirect checking
>>>Nov 8 19:03:35 sarek heartbeat: [4018]: WARN: Exiting
>>>/usr/lib/heartbeat/stonithd process 4038 killed by signal 11.
>>>Nov 8 19:03:35 sarek heartbeat: [4018]: ERROR: Exiting
>>>/usr/lib/heartbeat/stonithd process 4038 dumped core
>>>
>>>
>>>So the message is fine, the channel is messed up.
>>>
>>>I suspect the channel has already been destroied when the core dump
>>>happened.
>>>
>>>-Guochun
>>>
>>>
>>>
>>>
>>>Stefan Peinkofer wrote:
>>>
>>>
>>>>Hello Sun Jiang Dong,
>>>>On Tue, 2005-11-08 at 18:23 +0800, Sun Jiang Dong wrote:
>>>>
>>>>
>>>>
>>>>
>>>>>>>>>>>>>Anyway I think the problem you met has been fixed in CVS.
>>>>>>>>>>>>>Please have a try.
>>>>>>>>>>>>>If you still meet it, please tell me.
>
> Thanks.
>
>>>>>>>>>>>That was Problem 2 (cannot add field to ha_msg Error) which
>
> was
>
>>>>>>>>>>>fixed one or two weeks ago. What I mean is Problem 1 the
>>>>>>>>>>>stonithd coredump + not properly handled restart of the
>>>>>>>>>>>stonithd resources, after the core dump.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>And, I put some more safeguards into the code which was
>>>>>>>>>>>>implicated. And, gshi fixed a somewhat-related problem.
>>>>>>>>>>>>
>>>>>>>>>>>>Could you try again from CVS(HEAD)?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>I tried one from 2005-11-2 but it had still the problem 2.
>
> I
>
>>>>>>>>>>>will make a new try tomorrow and report the results.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>I tryed the recent CVS HEAD, and it shows still the same
>>>>>>>>>>behavior. After some time heartbeat was running:
>>>>>>>>>>Nov 5 11:47:58 sarek lrmd: [9297]: WARN:
>
> on_op_timeout_expired:
>
>>>>>>>>>>TIMEOUT: operation monitor[22] on
>
> stonith::wti_nps::kill_spock
>
>>>>>>>>>>for client 9298, its parameters: timeout=5000
>>>>>>>>>>ipaddr=192.168.1.204 te-target-rc=7 lrm-is-probe=true
>>>>>>>>>>password=XXXXX crm_feature_set=1.0.3 interval=10000 ...
>>>>>>>>>>Nov 5 11:48:01 sarek crmd: [9298]: ERROR:
>>>>>>>>>>mask(lrm.c:do_lrm_event): LRM operation (22) monitor_10000
>
> on
>
>>>>>>>>>>kill_spock Timed Out
>>>>>>>>>>...
>>>>>>>>>>Nov 5 11:48:02 sarek crmd: [9298]: info:
>>>>>>>>>>mask(lrm.c:do_lrm_rsc_op): Performing op stop on kill_spock
>>>>>>>>>>Nov 5 11:48:02 sarek crmd: [9298]: WARN:
>>>>>>>>>>mask(lrm.c:do_lrm_event): LRM operation (22) monitor_10000
>
> on
>
>>>>>>>>>>kill_spock Cancelled
>>>>>>>>>>...
>>>>>>>>>>Nov 5 11:48:04 sarek crmd: [9298]: info:
>>>>>>>>>>mask(lrm.c:do_lrm_rsc_op): Performing op start on kill_spoc
>>>>>>>>>>...
>>>>>>>>>>Nov 5 11:48:20 sarek crmd: [9298]: ERROR:
>>>>>>>>>>mask(lrm.c:do_lrm_event): LRM operation (26) start_0 on
>>>>>>>>>>kill_spock Error: unknown error
>>>>>>>>>>..
>>>>>>>>>>Nov 5 11:48:21 sarek crmd: [9298]: info:
>>>>>>>>>>mask(lrm.c:do_lrm_rsc_op): Performing op stop on kill_spock
>>>>>>>>>>Nov 5 11:48:21 sarek stonithd: [9296]: notice: try to stop
>
> a
>
>>>>>>>>>>resource kill_spock who is not in started resource queue.
>>>>>>>>>>Nov 5 11:48:22 sarek crmd: [9298]: info:
>>>>>>>>>>mask(lrm.c:do_update_resource): Updating kill_spock resource
>>>>>>>>>>definitions after stop op
>>>>>>>>>>...
>>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [9261]: WARN: Exiting
>>>>>>>>>>/usr/lib/heartbeat/stonithd process 9296 killed by signal 11.
>>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [9261]: ERROR: Exiting
>>>>>>>>>>/usr/lib/heartbeat/stonithd process 9296 dumped core
>>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [9261]: ERROR: Client
>>>>>>>>>>/usr/lib/heartbeat/stonithd killed by signal 11.
>>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [9261]: ERROR: Respawning
>>>>>>>>>>client "/usr/lib/heartbeat/stonithd":
>>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [9261]: info: Starting
>
> child
>
>>>>>>>>>>client "/usr/lib/heartbeat/stonithd" (0,0)
>>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [17057]: info: Starting
>>>>>>>>>>"/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 17057)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>I'm puzzled by this issue ( stonithd killed by signal 11 ) for
>
> a
>
>>>>>>>>>long time, because it's not reproduced on my machine.
>>>>>>>>>It's so fortune for me you can reproduce it stably. ;-)
>>>>>>>>>
>>>>>>>>
>>>>>>>>In fact it is killed everytime I start heartbeat. Sometimes it
>
> is
>
>>>>>>>>killed after 4 or 5 minutes takes a little bit longer (1 hour)
>>>>>>>>(subjective impression is that it takes longer if the machine
>
> is
>
>>>>>>>>fresh rebooted)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>I make a small patch again to current HEAD file
>>>>>>>>>lib/clplumbing/cl_msg.c. Can you please apply it and try
>
> again?
>
>>>>>>>>>This should be helpful for me to located the issue more
>
> further.
>
>>>>>>>>>Thanks a lots in advance.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>OK, used the current CVS HEAD from today. I have attached the
>
> logs
>
>>>>>>>>of both nodes.
>>>>>>>>Im not 100 percent sure yet, but it seems, that if stonithd
>>>>>>>>segfaulted one time, and therefore no monitor operations are
>>>>>>>>carried out anymore it will not segfault anymore. So maybe the
>>>>>>>>monitor operation causes the segfault somehow???
>>>>>>>>(Just wanted to mention that, perhaps it's helpful)
>>>>>>>>
>>>>>>>
>>>>>>>Thanks so much for your help.
>>>>>>>Besides, do you apply my small patch as the attachment? I
>
> cannot
>
>>>>>>>see the output
>>>>>>>from the small patch.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>And, from the log you attached, it seems the issue of this time
>
> has
>
>>>>>>>a different cause comparing to the last one. I added several
>
> memory
>
>>>>>>>initializing statements in CVS. Could you please have a try
>
> again.
>
>>>>>>>Thanks and waiting for your result.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>Ups, I misunderstood your mail, I though the patch were in the
>
> CVS
>
>>>>>>HEAD,
>>>>>>sorry. I think I will be able to apply the patch in a few hours
>
> and
>
>>>>>>then
>>>>>>mail you the logs.
>>>>>>
>>>>>>
>>>>>
>>>>>No problem. Look forward to your result.
>>>>>
>>>>
>>>>OK, I applied the patch some hours ago and started heartbeat.
>
> Somehow,
>
>>>>it took much longer until stontithd segfaulted (3 hours against few
>>>>minutes).
>>>>Since the log file is pretty hughe (6.6mb unziped and 176kb
>
> bzipped) I
>
>>>>attached only a little part of it. If you want me to mail the full
>
> logs
>
>>>>directely, let me know.
>>>>
>>>>Many thanks in advance.
>>>>Stefan Peinkofer
>>>>
>>>>
>>>>
>>>>>>>>BTW: I would much appreciate it, if someone could get LRM (or
>
> CRM)
>
>>>>>>>>to restart the stonith resources reliably, in such a case.
>
> It's
>
>>>>>>>>maybe sufficient if the stonith resources get restarted until
>
> the
>
>>>>>>>>start operation succeeds. Is there somewhere a trigger in
>
> cib.xml
>
>>>>>>>>where I can specify, try to restart infinitely? (or at least
>
> try
>
>>>>>>>>it 100 times or so :)
>>>>>>>>
>>>>>>>
>>>>>>>I'll file a bug for this, but currently only for tracking the
>>>>>>>requirement.
>>>>>>>http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=950
>>>>>>>
>>>>>>
>>>>>>Many thanks for that.
>>>>>>
>>>>>
>>>>>Welcome.
>>>>>
>>>>>
>>>>>
>>>>>>Stefan Peinkofer
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>Many thanks in advance.
>>>>>>>>Stefan Peinkofer
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>Note, I haven't attached full logs + core backtrace since
>
> the
>
>>>>>>>>>>look like the onesI have provided in the former mail. If you
>>>>>>>>>>want them regardles of that, let me know.
>>>>>>>>>>BTW. At least the OCF resource script IPAddr in the recent
>
> CVS
>
>>>>>>>>>>HEAD is "broken" (at least for my system). To get heartbeat
>>>>>>>>>>working for testing Problem 2 status, I used the ones from a
>
> CVS
>
>>>>>>>>>>version from 2005-11-02. I have no time today to investigate
>>>>>>>>>>further, but I think I will look at it closer towmorrow
>
> evening.
>
>>>>>>>>>>Many thanks in advance.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>BTW, I fixed the broken issue of the OCF IPAddr.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>Stefan Peinkofer
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>I have to Thank.
>>>>>>>>>>>Stefan Peinkofer
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>--
>>>>>>>>>>>> Alan Robertson <alanr [at] unix>
>>>>>>>>>>>>
>>>>>>>>>>>>"Openness is the foundation and preservative of
>
> friendship...
>
>>>>>>>>>>>>Let me claim from you at all times your undisguised
>
> opinions."
>
>>>>>>>>>>>>- William Wilberforce
>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>Linux-HA mailing list
>>>>>>>>>>>>Linux-HA [at] lists
>>>>>>>>>>>>http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>>>>>See also: http://linux-ha.org/ReportingProblems
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>Linux-HA mailing list
>>>>>>>>>>>Linux-HA [at] lists
>>>>>>>>>>>http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>>>>See also: http://linux-ha.org/ReportingProblems
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>_______________________________________________
>>>>>>>>>>Linux-HA mailing list
>>>>>>>>>>Linux-HA [at] lists
>>>>>>>>>>http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>>>See also: http://linux-ha.org/ReportingProblems
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>--
>>>>>>>>>BRs,
>>>>>>>>>
>>>>>>>>>Sun Jiang Dong
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>Index: cl_msg.c
>>>>>>>>>
>
> ===================================================================
>
>>>>>>>>>RCS
>
> file: /home/cvs/linux-ha/linux-ha/lib/clplumbing/cl_msg.c,v
>
>>>>>>>>>retrieving revision 1.101
>>>>>>>>>diff -u -r1.101 cl_msg.c
>>>>>>>>>--- cl_msg.c 3 Nov 2005 22:28:32 -0000 1.101
>>>>>>>>>+++ cl_msg.c 7 Nov 2005 07:35:43 -0000
>>>>>>>>>@@ -1964,11 +1964,24 @@
>>>>>>>>> return HA_FAIL;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>+ /* + * Just for debugging bug 730, will remove it
>
> after
>
>>>>>>>>>the bug is fixed.
>>>>>>>>>+ *
>
> http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=730
>
>>>>>>>>>+ */
>>>>>>>>>+ cl_log(LOG_INFO, "%s:%d: Will audit the ha_msg.",
>>>>>>>>>__FUNCTION__, __LINE__);
>>>>>>>>>+ AUDITMSG(m); +
>>>>>>>>>+ cl_log(LOG_INFO, "%s:%d: Will detect the status of the
>>>>>>>>>channel as an "
>>>>>>>>>+ " indirect checking", __FUNCTION__, __LINE__);
>>>>>>>>>+ cl_log(LOG_INFO, "Channel staus: %d",
>>>>>>>>>ch->ops->get_chan_status(ch));
>>>>>>>>>+
>>>>>>>>> if ((imsg = hamsg2ipcmsg(m, ch)) == NULL) {
>>>>>>>>> cl_log(LOG_ERR, "hamsg2ipcmsg() failure");
>>>>>>>>> return HA_FAIL;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>+ cl_log(LOG_INFO, "%s:%d: hamsg2ipcmsg() ok.",
>
> __FUNCTION__,
>
>>>>>>>>>__LINE__);
>>>>>>>>>+
>>>>>>>>> if (ch->ops->send(ch, imsg) != IPC_OK) {
>>>>>>>>> if (ch->ch_status == IPC_CONNECT) {
>>>>>>>>> snprintf(ch->failreason,MAXFAILREASON,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>------------------------------------------------------------------------
>>>>>>>>
>>>>>>>>_______________________________________________
>>>>>>>>Linux-HA mailing list
>>>>>>>>Linux-HA [at] lists
>>>>>>>>http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>See also: http://linux-ha.org/ReportingProblems

--
BRs,

Sun Jiang Dong

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


hasjd at cn

Nov 9, 2005, 7:27 PM

Post #52 of 55 (685 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Stefan Peinkofer wrote:
> On Wed, 2005-11-09 at 19:56 +0100, Stefan Peinkofer wrote:
>
>>On Wed, 2005-11-09 at 10:59 -0700, Alan Robertson wrote:
>>
>>>Stefan Peinkofer wrote:
>>>
>>>>Hello Sun Jiang Dong and Guochun Shi,
>>>>
>>>>I connected to the stonith devices via telnet today and they got hung in
>>>>the middle of displaying the plug state. (I could even ping them
>>>>anymore) That was weird since I had to restart the PWSW's in order to
>>>>login again. (Maybe waiting some time until the network connection
>>>>timeout of the power switches runs out had done the job too) I hope
>>>>there was no problem with the PWSW's that caused the segfault.
>>>
> Ahhhhhh, from wti_nps.c
> * 2. We observed that on busy networks where there may be high
> occurances
> * of broadcasts, the NPS became unresponsive. In some
> * configurations this necessitated placing the power switch onto a
> * private subnet.
> In fact is is on a private subnet but it may experience to much
> connections because of the 10second interval!?
>
It looks like a issue from stonith plugin. Anyway, I have no this type of
hardware. I can only read the source code then.

>>
>>OK, it's getting clearer and more weird.
>>Note: In my heartbeat config I use a monitoring interval of 10 seconds
>>and a timeout of 5 seconds for the stonith resources.
>>
>>After doing a:
>>while [ 1 ]; date; do stonith -t wti_nps ipaddr=192.168.1.204
>>password=XXXXX -S; done;
>>('Log' is attached)
>>At the beginning, the call returns within a second. After some minutes,
>>it takes (apruptely) about 3 to 4 seconds. If I cancel the call at this
>>stage and try to logon manually, the connection freezes as shown below.
>>(How differs the thing what the stonith plugin does from a manual telnet
>>login???)
>>
>>[root [at] sare log]# telnet kill-spock
>>Trying 192.168.1.204...
>>Connected to kill-spock (192.168.1.204).
>>Escape character is '^]'.
>>
>>Enter Password: *****
>>
>>Network Power Switch v3.02 Site: STONITH FOR SPOCK
>>
>>Plug | Name | Status | Boot Delay | Password |
>>Default |
>>-----+------------------+---------+------------+------------------+---------+
>> 1 | spock | ON | 5 sec | (undefined) | ON
>>|
>> 2 | (undefined) | ON | 5 sec | (undefined) | ON
>>|
>> 3 | (undefined) | ON | 5 sec | (undefined) | ON
>>|
>> 4 | (undefined) | ON | 5 sec | (un
>>
>>(Note, it freezed right at the password promt too, sometimes)
>>At the time it freezes, the device responds no longer to pings.
>>Note this is reliable reproducible, but only if I abort the sontih -S
>>loop and do a manual telnet connection. The stonith -S loop seems to run
>>'for ever' even though slow.
>>If I wait until the specified network connection timeout, the stonith
>>device becomes accessible again. Unfortunately the timeout can be set to
>>not less than 2 mins.
>>After this has occoured, the connections are fast again (for some time:)
>>
>>
>>Many thanks in advance.
>>Stefan Peinkofer
>>
>>
>>>No matter what else is true, it's a bug.
>>>
>>>
>>>
>>
>>_______________________________________________
>>Linux-HA mailing list
>>Linux-HA [at] lists
>>http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>See also: http://linux-ha.org/ReportingProblems
>>
>>
>>------------------------------------------------------------------------
>>
>>_______________________________________________
>>Linux-HA mailing list
>>Linux-HA [at] lists
>>http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>See also: http://linux-ha.org/ReportingProblems

--
BRs,

Sun Jiang Dong

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


peinkofe at fhm

Nov 11, 2005, 3:18 AM

Post #53 of 55 (692 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Hello everybody,
On Fri, Nov 11, 2005 at 06:08:31PM +0800, Sun Jiang Dong wrote:
> Stefan,
>
> The fix is already in CVS. Please have a try, and tell me the result. Thanks a
> lots in advance!
>
The stontihd segfault problem seems to be gone. But unfortunately my problem is not solved :). The sontih resources still become inactive. First the monitor op times out and the the start op returns Unknown Error. But I think that is my personal problem, because I'm pretty sure, that it is caused by the weird behavior of the wti nps (I wrote some emails about this yesterday).
Currently I'm trying to reproduce the behaviour, explained yesterday, with issuing stonith -S every ten seconds.
JFMY: Could it be that the stonithd segfault was caused by a chain of events which was triggered by the monitoring timeout or the start error?
I will try to increase the monitoring interval and/or the timeout for the stonith resources. If this doesn't help I may be forced to look after another stonith device :(

Anyway, I'd like to thank everyone who helped to fix the stonithd segfault problem. And I will let you know, if I had success with increasing the interval/timeout values.

Best Regards.
Stefan Peinkofer
> Guochun Shi wrote:
> > Stefan,
> >
> > We have found the cause of the problem. Sunjd will soon commit a fix for
> > that.
> >
> > By the way: ping node are useless right now if you configure "crm on",
> > so you probably want to remove those in ha.cf
> >
> > You can close the account for me if you find sunjd's soon-in-CVS patch
> > fixes your problem. Thanks for letting me debug
> > in your machines.
> >
> > have a nice day
> > -Guochun
> >
> >
> >
> >
> >
>
> --
> BRs,
>
> Sun Jiang Dong
>
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


gshi at ncsa

Nov 11, 2005, 12:27 PM

Post #54 of 55 (682 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

peinkofe [at] fhm wrote:

>Hello everybody,
>On Fri, Nov 11, 2005 at 06:08:31PM +0800, Sun Jiang Dong wrote:
>
>
>>Stefan,
>>
>>The fix is already in CVS. Please have a try, and tell me the result. Thanks a
>>lots in advance!
>>
>>
>>
>The stontihd segfault problem seems to be gone. But unfortunately my problem is not solved :). The sontih resources still become inactive. First the monitor op times out and the the start op returns Unknown Error. But I think that is my personal problem, because I'm pretty sure, that it is caused by the weird behavior of the wti nps (I wrote some emails about this yesterday).
>Currently I'm trying to reproduce the behaviour, explained yesterday, with issuing stonith -S every ten seconds.
>JFMY: Could it be that the stonithd segfault was caused by a chain of events which was triggered by the monitoring timeout or the start error?
>I will try to increase the monitoring interval and/or the timeout for the stonith resources. If this doesn't help I may be forced to look after another stonith device :(
>
>
I saw there were some error messages from log. They are
resource-related. Unfortunately I am not familar with that and cannot
fix them.
Don't ignore those error message. They probably indicates something is
wrong. You can post those error messages and someone in list may be able
to help you.

-Guochun
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


peinkofe at fhm

Nov 14, 2005, 10:07 PM

Post #55 of 55 (702 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Hello everybody,
On Thu, 2005-11-10 at 11:27 +0800, Sun Jiang Dong wrote:
>
> Stefan Peinkofer wrote:
> > On Wed, 2005-11-09 at 19:56 +0100, Stefan Peinkofer wrote:
> >
> >>On Wed, 2005-11-09 at 10:59 -0700, Alan Robertson wrote:
> >>
> >>>Stefan Peinkofer wrote:
> >>>
> >>>>Hello Sun Jiang Dong and Guochun Shi,
> >>>>
> >>>>I connected to the stonith devices via telnet today and they got hung in
> >>>>the middle of displaying the plug state. (I could even ping them
> >>>>anymore) That was weird since I had to restart the PWSW's in order to
> >>>>login again. (Maybe waiting some time until the network connection
> >>>>timeout of the power switches runs out had done the job too) I hope
> >>>>there was no problem with the PWSW's that caused the segfault.
> >>>
> > Ahhhhhh, from wti_nps.c
> > * 2. We observed that on busy networks where there may be high
> > occurances
> > * of broadcasts, the NPS became unresponsive. In some
> > * configurations this necessitated placing the power switch onto a
> > * private subnet.
> > In fact is is on a private subnet but it may experience to much
> > connections because of the 10second interval!?
> >
> It looks like a issue from stonith plugin. Anyway, I have no this type of
> hardware. I can only read the source code then.
>
Hmm, but what could the stonithd plugin do wrong? I have don some
testing with the
while [ 1 ]; date; do stonith -t wti_nps ipaddr=192.168.1.204
password=XXXXX -S; done;
statement. It turned out that after some time, the following occourred:
Mon Nov 14 11:47:46 CET 2005
** INFO: Successful login to WTI Network Power Switch.
stonith: wti_nps device OK.
Mon Nov 14 11:47:50 CET 2005
Mon Nov 14 11:50:50 CET 2005

** (process:25456): CRITICAL **: Did not find string Power Switch from
WTI Network Power Switch.
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused
connect() failed: Connection refused

** (process:25456): CRITICAL **: Cannot log into WTI Network Power
Switch.
stonith: wti_nps device not accessible.
Mon Nov 14 11:51:17 CET 2005

And this even occoured if I do the stonith -S only every 4 mins (The
configured network connection timeout of the stonith device is 2 mins)
So I think the wti nps device is not heartbeat 2 monitoring ready.
I don't know if increasing the interval even more would help. I
personally think that it then takes just longer until the error occours.
So I may will look for another stonith device.

Best Regards.
Stefan Peinkofer
> >>
> >>OK, it's getting clearer and more weird.
> >>Note: In my heartbeat config I use a monitoring interval of 10 seconds
> >>and a timeout of 5 seconds for the stonith resources.
> >>
> >>After doing a:
> >>while [ 1 ]; date; do stonith -t wti_nps ipaddr=192.168.1.204
> >>password=XXXXX -S; done;
> >>('Log' is attached)
> >>At the beginning, the call returns within a second. After some minutes,
> >>it takes (apruptely) about 3 to 4 seconds. If I cancel the call at this
> >>stage and try to logon manually, the connection freezes as shown below.
> >>(How differs the thing what the stonith plugin does from a manual telnet
> >>login???)
> >>
> >>[root [at] sare log]# telnet kill-spock
> >>Trying 192.168.1.204...
> >>Connected to kill-spock (192.168.1.204).
> >>Escape character is '^]'.
> >>
> >>Enter Password: *****
> >>
> >>Network Power Switch v3.02 Site: STONITH FOR SPOCK
> >>
> >>Plug | Name | Status | Boot Delay | Password |
> >>Default |
> >>-----+------------------+---------+------------+------------------+---------+
> >> 1 | spock | ON | 5 sec | (undefined) | ON
> >>|
> >> 2 | (undefined) | ON | 5 sec | (undefined) | ON
> >>|
> >> 3 | (undefined) | ON | 5 sec | (undefined) | ON
> >>|
> >> 4 | (undefined) | ON | 5 sec | (un
> >>
> >>(Note, it freezed right at the password promt too, sometimes)
> >>At the time it freezes, the device responds no longer to pings.
> >>Note this is reliable reproducible, but only if I abort the sontih -S
> >>loop and do a manual telnet connection. The stonith -S loop seems to run
> >>'for ever' even though slow.
> >>If I wait until the specified network connection timeout, the stonith
> >>device becomes accessible again. Unfortunately the timeout can be set to
> >>not less than 2 mins.
> >>After this has occoured, the connections are fast again (for some time:)
> >>
> >>
> >>Many thanks in advance.
> >>Stefan Peinkofer
> >>
> >>
> >>>No matter what else is true, it's a bug.
> >>>
> >>>
> >>>
> >>
> >>_______________________________________________
> >>Linux-HA mailing list
> >>Linux-HA [at] lists
> >>http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>See also: http://linux-ha.org/ReportingProblems
> >>
> >>
> >>------------------------------------------------------------------------
> >>
> >>_______________________________________________
> >>Linux-HA mailing list
> >>Linux-HA [at] lists
> >>http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>See also: http://linux-ha.org/ReportingProblems
>
Attachments: signature.asc (0.18 KB)

First page Previous page 1 2 3 Next page Last page  View All Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.