
hasjd at cn
Nov 9, 2005, 7:17 PM
Post #51 of 55
(704 views)
Permalink
|
|
Re: New problem(s) with heartbeat 2.0.3 and STONITH
[In reply to]
|
|
Stefan Peinkofer wrote: > Hello Sun Jiang Dong and Guochun Shi, > > On Wed, 2005-11-16 at 17:07 +0800, Sun Jiang Dong wrote: > >>Hi Stefan Peinkofer, >> >>I removed ZAPCHAN according to gshi's suggestion. Can you have a try > > of CVS HEAD > >>again. Thanks a lots in advance. >> > > I tried the CVS HEAD (with Sun's patch applied) but nothing has changed. > I have attached the logfile. > Question: If I look at the end of the logfile I see that some time after > the segfault, the messages from Sun's patch disappear. Is this normal? > (It was not so in the prior CVS HEAD version) No, as i see, the message is till there. Anyway the time tags is a little confusing. You can go forward using the pid as the thread. gshi, this is the result after removing all ZAPCHAN in stonithd. Do you have any discovery on it? > > >>>>>>>Im not 100 percent sure yet, but it seems, that if stonithd >>>>>>> >>>>>>>>segfaulted one time, and therefore no monitor operations are >>>>>>>>carried out anymore it will not segfault anymore. So maybe the >>>>>>>>monitor operation causes the segfault somehow??? >>>>>>>>(Just wanted to mention that, perhaps it's helpful > > Note, I let heartbeat run over the night, tonight. > After the segfault on sarek at 19:03 stonithd segfaulted not again. > (Watched until 11:40 on the next day because then I stopped and started > the new CVS HEAD version) > > Many thanks in advance. > Stefan Peinkofer > > >>Guochun Shi wrote: >> >>>Nov 8 19:03:35 sarek stonithd: [4038]: info: msg2ipcchan:1971: > > Will > >>>audit the ha_msg. >>>Nov 8 19:03:35 sarek stonithd: [4038]: info: msg2ipcchan:1975: > > Will > >>>detect the status of the channel as an indirect checking >>>Nov 8 19:03:35 sarek heartbeat: [4018]: WARN: Exiting >>>/usr/lib/heartbeat/stonithd process 4038 killed by signal 11. >>>Nov 8 19:03:35 sarek heartbeat: [4018]: ERROR: Exiting >>>/usr/lib/heartbeat/stonithd process 4038 dumped core >>> >>> >>>So the message is fine, the channel is messed up. >>> >>>I suspect the channel has already been destroied when the core dump >>>happened. >>> >>>-Guochun >>> >>> >>> >>> >>>Stefan Peinkofer wrote: >>> >>> >>>>Hello Sun Jiang Dong, >>>>On Tue, 2005-11-08 at 18:23 +0800, Sun Jiang Dong wrote: >>>> >>>> >>>> >>>> >>>>>>>>>>>>>Anyway I think the problem you met has been fixed in CVS. >>>>>>>>>>>>>Please have a try. >>>>>>>>>>>>>If you still meet it, please tell me. > > Thanks. > >>>>>>>>>>>That was Problem 2 (cannot add field to ha_msg Error) which > > was > >>>>>>>>>>>fixed one or two weeks ago. What I mean is Problem 1 the >>>>>>>>>>>stonithd coredump + not properly handled restart of the >>>>>>>>>>>stonithd resources, after the core dump. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>And, I put some more safeguards into the code which was >>>>>>>>>>>>implicated. And, gshi fixed a somewhat-related problem. >>>>>>>>>>>> >>>>>>>>>>>>Could you try again from CVS(HEAD)? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>I tried one from 2005-11-2 but it had still the problem 2. > > I > >>>>>>>>>>>will make a new try tomorrow and report the results. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>I tryed the recent CVS HEAD, and it shows still the same >>>>>>>>>>behavior. After some time heartbeat was running: >>>>>>>>>>Nov 5 11:47:58 sarek lrmd: [9297]: WARN: > > on_op_timeout_expired: > >>>>>>>>>>TIMEOUT: operation monitor[22] on > > stonith::wti_nps::kill_spock > >>>>>>>>>>for client 9298, its parameters: timeout=5000 >>>>>>>>>>ipaddr=192.168.1.204 te-target-rc=7 lrm-is-probe=true >>>>>>>>>>password=XXXXX crm_feature_set=1.0.3 interval=10000 ... >>>>>>>>>>Nov 5 11:48:01 sarek crmd: [9298]: ERROR: >>>>>>>>>>mask(lrm.c:do_lrm_event): LRM operation (22) monitor_10000 > > on > >>>>>>>>>>kill_spock Timed Out >>>>>>>>>>... >>>>>>>>>>Nov 5 11:48:02 sarek crmd: [9298]: info: >>>>>>>>>>mask(lrm.c:do_lrm_rsc_op): Performing op stop on kill_spock >>>>>>>>>>Nov 5 11:48:02 sarek crmd: [9298]: WARN: >>>>>>>>>>mask(lrm.c:do_lrm_event): LRM operation (22) monitor_10000 > > on > >>>>>>>>>>kill_spock Cancelled >>>>>>>>>>... >>>>>>>>>>Nov 5 11:48:04 sarek crmd: [9298]: info: >>>>>>>>>>mask(lrm.c:do_lrm_rsc_op): Performing op start on kill_spoc >>>>>>>>>>... >>>>>>>>>>Nov 5 11:48:20 sarek crmd: [9298]: ERROR: >>>>>>>>>>mask(lrm.c:do_lrm_event): LRM operation (26) start_0 on >>>>>>>>>>kill_spock Error: unknown error >>>>>>>>>>.. >>>>>>>>>>Nov 5 11:48:21 sarek crmd: [9298]: info: >>>>>>>>>>mask(lrm.c:do_lrm_rsc_op): Performing op stop on kill_spock >>>>>>>>>>Nov 5 11:48:21 sarek stonithd: [9296]: notice: try to stop > > a > >>>>>>>>>>resource kill_spock who is not in started resource queue. >>>>>>>>>>Nov 5 11:48:22 sarek crmd: [9298]: info: >>>>>>>>>>mask(lrm.c:do_update_resource): Updating kill_spock resource >>>>>>>>>>definitions after stop op >>>>>>>>>>... >>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [9261]: WARN: Exiting >>>>>>>>>>/usr/lib/heartbeat/stonithd process 9296 killed by signal 11. >>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [9261]: ERROR: Exiting >>>>>>>>>>/usr/lib/heartbeat/stonithd process 9296 dumped core >>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [9261]: ERROR: Client >>>>>>>>>>/usr/lib/heartbeat/stonithd killed by signal 11. >>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [9261]: ERROR: Respawning >>>>>>>>>>client "/usr/lib/heartbeat/stonithd": >>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [9261]: info: Starting > > child > >>>>>>>>>>client "/usr/lib/heartbeat/stonithd" (0,0) >>>>>>>>>>Nov 5 11:48:24 sarek heartbeat: [17057]: info: Starting >>>>>>>>>>"/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 17057) >>>>>>>>>> >>>>>>>>> >>>>>>>>>I'm puzzled by this issue ( stonithd killed by signal 11 ) for > > a > >>>>>>>>>long time, because it's not reproduced on my machine. >>>>>>>>>It's so fortune for me you can reproduce it stably. ;-) >>>>>>>>> >>>>>>>> >>>>>>>>In fact it is killed everytime I start heartbeat. Sometimes it > > is > >>>>>>>>killed after 4 or 5 minutes takes a little bit longer (1 hour) >>>>>>>>(subjective impression is that it takes longer if the machine > > is > >>>>>>>>fresh rebooted) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>I make a small patch again to current HEAD file >>>>>>>>>lib/clplumbing/cl_msg.c. Can you please apply it and try > > again? > >>>>>>>>>This should be helpful for me to located the issue more > > further. > >>>>>>>>>Thanks a lots in advance. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>>OK, used the current CVS HEAD from today. I have attached the > > logs > >>>>>>>>of both nodes. >>>>>>>>Im not 100 percent sure yet, but it seems, that if stonithd >>>>>>>>segfaulted one time, and therefore no monitor operations are >>>>>>>>carried out anymore it will not segfault anymore. So maybe the >>>>>>>>monitor operation causes the segfault somehow??? >>>>>>>>(Just wanted to mention that, perhaps it's helpful) >>>>>>>> >>>>>>> >>>>>>>Thanks so much for your help. >>>>>>>Besides, do you apply my small patch as the attachment? I > > cannot > >>>>>>>see the output >>>>>>>from the small patch. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>And, from the log you attached, it seems the issue of this time > > has > >>>>>>>a different cause comparing to the last one. I added several > > memory > >>>>>>>initializing statements in CVS. Could you please have a try > > again. > >>>>>>>Thanks and waiting for your result. >>>>>>> >>>>>>> >>>>>> >>>>>>Ups, I misunderstood your mail, I though the patch were in the > > CVS > >>>>>>HEAD, >>>>>>sorry. I think I will be able to apply the patch in a few hours > > and > >>>>>>then >>>>>>mail you the logs. >>>>>> >>>>>> >>>>> >>>>>No problem. Look forward to your result. >>>>> >>>> >>>>OK, I applied the patch some hours ago and started heartbeat. > > Somehow, > >>>>it took much longer until stontithd segfaulted (3 hours against few >>>>minutes). >>>>Since the log file is pretty hughe (6.6mb unziped and 176kb > > bzipped) I > >>>>attached only a little part of it. If you want me to mail the full > > logs > >>>>directely, let me know. >>>> >>>>Many thanks in advance. >>>>Stefan Peinkofer >>>> >>>> >>>> >>>>>>>>BTW: I would much appreciate it, if someone could get LRM (or > > CRM) > >>>>>>>>to restart the stonith resources reliably, in such a case. > > It's > >>>>>>>>maybe sufficient if the stonith resources get restarted until > > the > >>>>>>>>start operation succeeds. Is there somewhere a trigger in > > cib.xml > >>>>>>>>where I can specify, try to restart infinitely? (or at least > > try > >>>>>>>>it 100 times or so :) >>>>>>>> >>>>>>> >>>>>>>I'll file a bug for this, but currently only for tracking the >>>>>>>requirement. >>>>>>>http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=950 >>>>>>> >>>>>> >>>>>>Many thanks for that. >>>>>> >>>>> >>>>>Welcome. >>>>> >>>>> >>>>> >>>>>>Stefan Peinkofer >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>>Many thanks in advance. >>>>>>>>Stefan Peinkofer >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>>Note, I haven't attached full logs + core backtrace since > > the > >>>>>>>>>>look like the onesI have provided in the former mail. If you >>>>>>>>>>want them regardles of that, let me know. >>>>>>>>>>BTW. At least the OCF resource script IPAddr in the recent > > CVS > >>>>>>>>>>HEAD is "broken" (at least for my system). To get heartbeat >>>>>>>>>>working for testing Problem 2 status, I used the ones from a > > CVS > >>>>>>>>>>version from 2005-11-02. I have no time today to investigate >>>>>>>>>>further, but I think I will look at it closer towmorrow > > evening. > >>>>>>>>>>Many thanks in advance. >>>>>>>>>> >>>>>>>>> >>>>>>>>>BTW, I fixed the broken issue of the OCF IPAddr. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>Stefan Peinkofer >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>I have to Thank. >>>>>>>>>>>Stefan Peinkofer >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>-- >>>>>>>>>>>> Alan Robertson <alanr [at] unix> >>>>>>>>>>>> >>>>>>>>>>>>"Openness is the foundation and preservative of > > friendship... > >>>>>>>>>>>>Let me claim from you at all times your undisguised > > opinions." > >>>>>>>>>>>>- William Wilberforce >>>>>>>>>>>>_______________________________________________ >>>>>>>>>>>>Linux-HA mailing list >>>>>>>>>>>>Linux-HA [at] lists >>>>>>>>>>>>http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>>>>>See also: http://linux-ha.org/ReportingProblems >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>_______________________________________________ >>>>>>>>>>>Linux-HA mailing list >>>>>>>>>>>Linux-HA [at] lists >>>>>>>>>>>http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>>>>See also: http://linux-ha.org/ReportingProblems >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>_______________________________________________ >>>>>>>>>>Linux-HA mailing list >>>>>>>>>>Linux-HA [at] lists >>>>>>>>>>http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>>>See also: http://linux-ha.org/ReportingProblems >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>>-- >>>>>>>>>BRs, >>>>>>>>> >>>>>>>>>Sun Jiang Dong >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>Index: cl_msg.c >>>>>>>>> > > =================================================================== > >>>>>>>>>RCS > > file: /home/cvs/linux-ha/linux-ha/lib/clplumbing/cl_msg.c,v > >>>>>>>>>retrieving revision 1.101 >>>>>>>>>diff -u -r1.101 cl_msg.c >>>>>>>>>--- cl_msg.c 3 Nov 2005 22:28:32 -0000 1.101 >>>>>>>>>+++ cl_msg.c 7 Nov 2005 07:35:43 -0000 >>>>>>>>>@@ -1964,11 +1964,24 @@ >>>>>>>>> return HA_FAIL; >>>>>>>>> } >>>>>>>>> >>>>>>>>>+ /* + * Just for debugging bug 730, will remove it > > after > >>>>>>>>>the bug is fixed. >>>>>>>>>+ * > > http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=730 > >>>>>>>>>+ */ >>>>>>>>>+ cl_log(LOG_INFO, "%s:%d: Will audit the ha_msg.", >>>>>>>>>__FUNCTION__, __LINE__); >>>>>>>>>+ AUDITMSG(m); + >>>>>>>>>+ cl_log(LOG_INFO, "%s:%d: Will detect the status of the >>>>>>>>>channel as an " >>>>>>>>>+ " indirect checking", __FUNCTION__, __LINE__); >>>>>>>>>+ cl_log(LOG_INFO, "Channel staus: %d", >>>>>>>>>ch->ops->get_chan_status(ch)); >>>>>>>>>+ >>>>>>>>> if ((imsg = hamsg2ipcmsg(m, ch)) == NULL) { >>>>>>>>> cl_log(LOG_ERR, "hamsg2ipcmsg() failure"); >>>>>>>>> return HA_FAIL; >>>>>>>>> } >>>>>>>>> >>>>>>>>>+ cl_log(LOG_INFO, "%s:%d: hamsg2ipcmsg() ok.", > > __FUNCTION__, > >>>>>>>>>__LINE__); >>>>>>>>>+ >>>>>>>>> if (ch->ops->send(ch, imsg) != IPC_OK) { >>>>>>>>> if (ch->ch_status == IPC_CONNECT) { >>>>>>>>> snprintf(ch->failreason,MAXFAILREASON, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>------------------------------------------------------------------------ >>>>>>>> >>>>>>>>_______________________________________________ >>>>>>>>Linux-HA mailing list >>>>>>>>Linux-HA [at] lists >>>>>>>>http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>See also: http://linux-ha.org/ReportingProblems -- BRs, Sun Jiang Dong _______________________________________________ Linux-HA mailing list Linux-HA [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|