
peinkofe at fhm
Nov 9, 2005, 7:44 AM
Post #45 of 55
(1848 views)
Permalink
|
|
Re: New problem(s) with heartbeat 2.0.3 and STONITH
[In reply to]
|
|
Hello Sun Jiang Dong and Guochun Shi, On Wed, 2005-11-16 at 17:07 +0800, Sun Jiang Dong wrote: > Hi Stefan Peinkofer, > > I removed ZAPCHAN according to gshi's suggestion. Can you have a try of CVS HEAD > again. Thanks a lots in advance. > I tried the CVS HEAD (with Sun's patch applied) but nothing has changed. I have attached the logfile. Question: If I look at the end of the logfile I see that some time after the segfault, the messages from Sun's patch disappear. Is this normal? (It was not so in the prior CVS HEAD version) >>>>>> Im not 100 percent sure yet, but it seems, that if stonithd > >>>>>> segfaulted one time, and therefore no monitor operations are > >>>>>> carried out anymore it will not segfault anymore. So maybe the > >>>>>> monitor operation causes the segfault somehow??? > >>>>>> (Just wanted to mention that, perhaps it's helpful Note, I let heartbeat run over the night, tonight. After the segfault on sarek at 19:03 stonithd segfaulted not again. (Watched until 11:40 on the next day because then I stopped and started the new CVS HEAD version) Many thanks in advance. Stefan Peinkofer > Guochun Shi wrote: > > Nov 8 19:03:35 sarek stonithd: [4038]: info: msg2ipcchan:1971: Will > > audit the ha_msg. > > Nov 8 19:03:35 sarek stonithd: [4038]: info: msg2ipcchan:1975: Will > > detect the status of the channel as an indirect checking > > Nov 8 19:03:35 sarek heartbeat: [4018]: WARN: Exiting > > /usr/lib/heartbeat/stonithd process 4038 killed by signal 11. > > Nov 8 19:03:35 sarek heartbeat: [4018]: ERROR: Exiting > > /usr/lib/heartbeat/stonithd process 4038 dumped core > > > > > > So the message is fine, the channel is messed up. > > > > I suspect the channel has already been destroied when the core dump > > happened. > > > > -Guochun > > > > > > > > > > Stefan Peinkofer wrote: > > > >> Hello Sun Jiang Dong, > >> On Tue, 2005-11-08 at 18:23 +0800, Sun Jiang Dong wrote: > >> > >> > >> > >>>>>>>>>>> Anyway I think the problem you met has been fixed in CVS. > >>>>>>>>>>> Please have a try. > >>>>>>>>>>> If you still meet it, please tell me. Thanks. > >>>>>>>>> > >>>>>>>>> That was Problem 2 (cannot add field to ha_msg Error) which was > >>>>>>>>> fixed one or two weeks ago. What I mean is Problem 1 the > >>>>>>>>> stonithd coredump + not properly handled restart of the > >>>>>>>>> stonithd resources, after the core dump. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> And, I put some more safeguards into the code which was > >>>>>>>>>> implicated. And, gshi fixed a somewhat-related problem. > >>>>>>>>>> > >>>>>>>>>> Could you try again from CVS(HEAD)? > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> I tried one from 2005-11-2 but it had still the problem 2. I > >>>>>>>>> will make a new try tomorrow and report the results. > >>>>>>>>> > >>>>>>>> > >>>>>>>> I tryed the recent CVS HEAD, and it shows still the same > >>>>>>>> behavior. After some time heartbeat was running: > >>>>>>>> Nov 5 11:47:58 sarek lrmd: [9297]: WARN: on_op_timeout_expired: > >>>>>>>> TIMEOUT: operation monitor[22] on stonith::wti_nps::kill_spock > >>>>>>>> for client 9298, its parameters: timeout=5000 > >>>>>>>> ipaddr=192.168.1.204 te-target-rc=7 lrm-is-probe=true > >>>>>>>> password=XXXXX crm_feature_set=1.0.3 interval=10000 ... > >>>>>>>> Nov 5 11:48:01 sarek crmd: [9298]: ERROR: > >>>>>>>> mask(lrm.c:do_lrm_event): LRM operation (22) monitor_10000 on > >>>>>>>> kill_spock Timed Out > >>>>>>>> ... > >>>>>>>> Nov 5 11:48:02 sarek crmd: [9298]: info: > >>>>>>>> mask(lrm.c:do_lrm_rsc_op): Performing op stop on kill_spock > >>>>>>>> Nov 5 11:48:02 sarek crmd: [9298]: WARN: > >>>>>>>> mask(lrm.c:do_lrm_event): LRM operation (22) monitor_10000 on > >>>>>>>> kill_spock Cancelled > >>>>>>>> ... > >>>>>>>> Nov 5 11:48:04 sarek crmd: [9298]: info: > >>>>>>>> mask(lrm.c:do_lrm_rsc_op): Performing op start on kill_spoc > >>>>>>>> ... > >>>>>>>> Nov 5 11:48:20 sarek crmd: [9298]: ERROR: > >>>>>>>> mask(lrm.c:do_lrm_event): LRM operation (26) start_0 on > >>>>>>>> kill_spock Error: unknown error > >>>>>>>> .. > >>>>>>>> Nov 5 11:48:21 sarek crmd: [9298]: info: > >>>>>>>> mask(lrm.c:do_lrm_rsc_op): Performing op stop on kill_spock > >>>>>>>> Nov 5 11:48:21 sarek stonithd: [9296]: notice: try to stop a > >>>>>>>> resource kill_spock who is not in started resource queue. > >>>>>>>> Nov 5 11:48:22 sarek crmd: [9298]: info: > >>>>>>>> mask(lrm.c:do_update_resource): Updating kill_spock resource > >>>>>>>> definitions after stop op > >>>>>>>> ... > >>>>>>>> Nov 5 11:48:24 sarek heartbeat: [9261]: WARN: Exiting > >>>>>>>> /usr/lib/heartbeat/stonithd process 9296 killed by signal 11. > >>>>>>>> Nov 5 11:48:24 sarek heartbeat: [9261]: ERROR: Exiting > >>>>>>>> /usr/lib/heartbeat/stonithd process 9296 dumped core > >>>>>>>> Nov 5 11:48:24 sarek heartbeat: [9261]: ERROR: Client > >>>>>>>> /usr/lib/heartbeat/stonithd killed by signal 11. > >>>>>>>> Nov 5 11:48:24 sarek heartbeat: [9261]: ERROR: Respawning > >>>>>>>> client "/usr/lib/heartbeat/stonithd": > >>>>>>>> Nov 5 11:48:24 sarek heartbeat: [9261]: info: Starting child > >>>>>>>> client "/usr/lib/heartbeat/stonithd" (0,0) > >>>>>>>> Nov 5 11:48:24 sarek heartbeat: [17057]: info: Starting > >>>>>>>> "/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 17057) > >>>>>>>> > >>>>>>> > >>>>>>> I'm puzzled by this issue ( stonithd killed by signal 11 ) for a > >>>>>>> long time, because it's not reproduced on my machine. > >>>>>>> It's so fortune for me you can reproduce it stably. ;-) > >>>>>>> > >>>>>> > >>>>>> In fact it is killed everytime I start heartbeat. Sometimes it is > >>>>>> killed after 4 or 5 minutes takes a little bit longer (1 hour) > >>>>>> (subjective impression is that it takes longer if the machine is > >>>>>> fresh rebooted) > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>> I make a small patch again to current HEAD file > >>>>>>> lib/clplumbing/cl_msg.c. Can you please apply it and try again? > >>>>>>> This should be helpful for me to located the issue more further. > >>>>>>> Thanks a lots in advance. > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> OK, used the current CVS HEAD from today. I have attached the logs > >>>>>> of both nodes. > >>>>>> Im not 100 percent sure yet, but it seems, that if stonithd > >>>>>> segfaulted one time, and therefore no monitor operations are > >>>>>> carried out anymore it will not segfault anymore. So maybe the > >>>>>> monitor operation causes the segfault somehow??? > >>>>>> (Just wanted to mention that, perhaps it's helpful) > >>>>>> > >>>>> > >>>>> Thanks so much for your help. > >>>>> Besides, do you apply my small patch as the attachment? I cannot > >>>>> see the output > >>>>> from the small patch. > >>>> > >>>> > >>>> > >>>>> And, from the log you attached, it seems the issue of this time has > >>>>> a different cause comparing to the last one. I added several memory > >>>>> initializing statements in CVS. Could you please have a try again. > >>>>> Thanks and waiting for your result. > >>>>> > >>>>> > >>>> > >>>> Ups, I misunderstood your mail, I though the patch were in the CVS > >>>> HEAD, > >>>> sorry. I think I will be able to apply the patch in a few hours and > >>>> then > >>>> mail you the logs. > >>>> > >>>> > >>> > >>> No problem. Look forward to your result. > >>> > >> > >> OK, I applied the patch some hours ago and started heartbeat. Somehow, > >> it took much longer until stontithd segfaulted (3 hours against few > >> minutes). > >> Since the log file is pretty hughe (6.6mb unziped and 176kb bzipped) I > >> attached only a little part of it. If you want me to mail the full logs > >> directely, let me know. > >> > >> Many thanks in advance. > >> Stefan Peinkofer > >> > >> > >>>>>> BTW: I would much appreciate it, if someone could get LRM (or CRM) > >>>>>> to restart the stonith resources reliably, in such a case. It's > >>>>>> maybe sufficient if the stonith resources get restarted until the > >>>>>> start operation succeeds. Is there somewhere a trigger in cib.xml > >>>>>> where I can specify, try to restart infinitely? (or at least try > >>>>>> it 100 times or so :) > >>>>>> > >>>>> > >>>>> I'll file a bug for this, but currently only for tracking the > >>>>> requirement. > >>>>> http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=950 > >>>>> > >>>> > >>>> Many thanks for that. > >>>> > >>> > >>> Welcome. > >>> > >>> > >>>> Stefan Peinkofer > >>>> > >>>> > >>>> > >>>>>> Many thanks in advance. > >>>>>> Stefan Peinkofer > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>> Note, I haven't attached full logs + core backtrace since the > >>>>>>>> look like the onesI have provided in the former mail. If you > >>>>>>>> want them regardles of that, let me know. > >>>>>>>> BTW. At least the OCF resource script IPAddr in the recent CVS > >>>>>>>> HEAD is "broken" (at least for my system). To get heartbeat > >>>>>>>> working for testing Problem 2 status, I used the ones from a CVS > >>>>>>>> version from 2005-11-02. I have no time today to investigate > >>>>>>>> further, but I think I will look at it closer towmorrow evening. > >>>>>>>> Many thanks in advance. > >>>>>>>> > >>>>>>> > >>>>>>> BTW, I fixed the broken issue of the OCF IPAddr. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> Stefan Peinkofer > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>>> Thanks! > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> I have to Thank. > >>>>>>>>> Stefan Peinkofer > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Alan Robertson <alanr [at] unix> > >>>>>>>>>> > >>>>>>>>>> "Openness is the foundation and preservative of friendship... > >>>>>>>>>> Let me claim from you at all times your undisguised opinions." > >>>>>>>>>> - William Wilberforce > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Linux-HA mailing list > >>>>>>>>>> Linux-HA [at] lists > >>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >>>>>>>>>> See also: http://linux-ha.org/ReportingProblems > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Linux-HA mailing list > >>>>>>>>> Linux-HA [at] lists > >>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >>>>>>>>> See also: http://linux-ha.org/ReportingProblems > >>>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Linux-HA mailing list > >>>>>>>> Linux-HA [at] lists > >>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >>>>>>>> See also: http://linux-ha.org/ReportingProblems > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> BRs, > >>>>>>> > >>>>>>> Sun Jiang Dong > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>> Index: cl_msg.c > >>>>>>> =================================================================== > >>>>>>> RCS file: /home/cvs/linux-ha/linux-ha/lib/clplumbing/cl_msg.c,v > >>>>>>> retrieving revision 1.101 > >>>>>>> diff -u -r1.101 cl_msg.c > >>>>>>> --- cl_msg.c 3 Nov 2005 22:28:32 -0000 1.101 > >>>>>>> +++ cl_msg.c 7 Nov 2005 07:35:43 -0000 > >>>>>>> @@ -1964,11 +1964,24 @@ > >>>>>>> return HA_FAIL; > >>>>>>> } > >>>>>>> > >>>>>>> + /* + * Just for debugging bug 730, will remove it after > >>>>>>> the bug is fixed. > >>>>>>> + * http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=730 > >>>>>>> + */ > >>>>>>> + cl_log(LOG_INFO, "%s:%d: Will audit the ha_msg.", > >>>>>>> __FUNCTION__, __LINE__); > >>>>>>> + AUDITMSG(m); + > >>>>>>> + cl_log(LOG_INFO, "%s:%d: Will detect the status of the > >>>>>>> channel as an " > >>>>>>> + " indirect checking", __FUNCTION__, __LINE__); > >>>>>>> + cl_log(LOG_INFO, "Channel staus: %d", > >>>>>>> ch->ops->get_chan_status(ch)); > >>>>>>> + > >>>>>>> if ((imsg = hamsg2ipcmsg(m, ch)) == NULL) { > >>>>>>> cl_log(LOG_ERR, "hamsg2ipcmsg() failure"); > >>>>>>> return HA_FAIL; > >>>>>>> } > >>>>>>> > >>>>>>> + cl_log(LOG_INFO, "%s:%d: hamsg2ipcmsg() ok.", __FUNCTION__, > >>>>>>> __LINE__); > >>>>>>> + > >>>>>>> if (ch->ops->send(ch, imsg) != IPC_OK) { > >>>>>>> if (ch->ch_status == IPC_CONNECT) { > >>>>>>> snprintf(ch->failreason,MAXFAILREASON, > >>>>>> > >>>>>>
|