
peinkofe at fhm
Oct 31, 2005, 1:58 AM
Post #18 of 55
(2503 views)
Permalink
|
|
Re: New problem(s) with heartbeat 2.0.3 and STONITH
[In reply to]
|
|
Hello everybody, On Sun, Oct 30, 2005 at 07:29:31PM +0100, peinkofe [at] fhm wrote: > Hello everybody, > > On Fri, Oct 28, 2005 at 11:16:28PM -0600, Alan Robertson wrote: > > Peter Kruse wrote: > > > Hello, > > > > > > Alan Robertson wrote: > > > > > >> My guess is that op->node_name or op->optype is NULL. The code should > > >> have validated those. Since they're critical, and they come from > > >> who-knows-where (meaning some doofus user process), they should > > >> definitely have been error checked, and there should be a clear > > >> message about their errors. > > >> > > > I'm sorry, but I don't understand any of this. Does that mean you know the > > > cause of this error, or just that the error message has no meaning? > > > > It means I was reading the code, and got a clue from it, and was in > > effect hinting to the author or that code to look at it in more detail. > > > > From emails that were sent, it appears that he got the hint and looked > > at it. From looking at the CVS logs, it looks like a patch was checked > > in for this problem. > > > Yes, I just tried the current cvs version and it works. (Problem 2 (the "cannot add field to ha_msg" Error) is gone and Problem 1 seems to be solved either) > Seems that I was a little bit too optimistic. Problem 1 isn't solved yet. In fact it worked once and failed many times. In the case which worked, a timeout of the monitor op was discovered: Oct 30 19:01:46 spock lrmd: [4468]: WARN: on_op_timeout_expired: TIMEOUT: operation monitor[15] on stonith::wti_nps::kill_sarek for client 4469, its parameters: timeout=5000 ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true password=XXXXXXX crm_feature_set=1.0.3 interval=10000 . Oct 30 19:01:51 spock crmd: [4469]: ERROR: mask(lrm.c:do_lrm_event): LRM operation (15) monitor_10000 on kill_sarek Timed Out The it said that sontihd was killed by signal 11 and respawned it. Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 4467 killed by signal 11. Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 4467 dumped core Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Client /usr/lib/heartbeat/stonithd killed by signal 11. Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Respawning client "/usr/lib/heartbeat/stonithd": Oct 30 19:01:55 spock heartbeat: [4447]: info: Starting child client "/usr/lib/heartbeat/stonithd" (0,0) Oct 30 19:01:55 spock heartbeat: [11922]: info: Starting "/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 11922) Then it said, that it want to start the stonith resource again. Oct 30 19:01:59 spock crmd: [4469]: info: mask(lrm.c:do_lrm_rsc_op): Performing op start on kill_sarek And the resource was active again. In the next case it didn't work: Again it noticed the op monitor timeout: Oct 30 19:31:50 spock lrmd: [4468]: WARN: on_op_timeout_expired: TIMEOUT: operation monitor[27] on stonith::wti_nps::kill_sarek for client 4469, its parameters: timeout=5000 ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true password=XXXXXXX crm_feature_set=1.0.3 interval=10000 . Oct 30 19:31:58 spock crmd: [4469]: ERROR: mask(lrm.c:do_lrm_event): LRM operation (27) monitor_10000 on kill_sarek Timed Out Then it tryed to perform a op start on the stonith resource: Oct 30 19:32:01 spock crmd: [4469]: info: mask(lrm.c:do_lrm_rsc_op): Performing op start on kill_sarek which failed Oct 30 19:32:12 spock crmd: [4469]: ERROR: mask(lrm.c:do_lrm_event): LRM operation (30) start_0 on kill_sarek Error: unknown error and after that it notices the dead of stonithd and respawns it. Oct 30 19:32:16 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 11922 killed by signal 11. Oct 30 19:32:16 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 11922 dumped core Oct 30 19:32:16 spock heartbeat: [4447]: ERROR: Client /usr/lib/heartbeat/stonithd killed by signal 11. Oct 30 19:32:16 spock heartbeat: [4447]: ERROR: Respawning client "/usr/lib/heartbeat/stonithd": I have attached a file which contains all stonithd killed by signal 11 cases which occorued form yesterday to today. Especially interesting is the last case: the it says something about STONITH_RA_EXEC: cannot sign on the sontihd which only occoured in this case. Btw. pengine tells someting about a memory leak. Many thanks in advance. Stefan Peinkofer > Many thanks to all who fixed the Problems. > > Best regards, > Stefan Peinkofer > > Exactly what the cause was (from your perspective), I'm not sure. > > > > -- > > Alan Robertson <alanr [at] unix> > > > > "Openness is the foundation and preservative of friendship... Let me > > claim from you at all times your undisguised opinions." - William > > Wilberforce > > > X-Account-Key: account1 > > Return-Path: <linux-ha-cvs-bounces [at] lists> > > Delivered-To: spamcop-net-alanr [at] spamcop > > Received: (qmail 23876 invoked from network); 29 Oct 2005 03:31:33 -0000 > > X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on blade4 > > X-Spam-Level: > > X-Spam-Status: hits=0.6 tests=AWL,NO_REAL_NAME version=3.1.0 > > Received: from unknown (192.168.1.103) > > by blade4.cesmail.net with QMQP; 29 Oct 2005 03:31:33 -0000 > > Received: from mail.maclawran.ca (HELO demo.bb4.com) (65.39.147.83) > > by mx53.cesmail.net with SMTP; 29 Oct 2005 03:31:33 -0000 > > Received: from new.community.tummy.com (postfix [at] newcommunity > > [198.49.126.209]) > > by demo.bb4.com (8.13.0/8.13.0) with ESMTP id j9T3QTte074372; > > Fri, 28 Oct 2005 23:26:29 -0400 (EDT) > > Received: from newcommunity.tummy.com (localhost [127.0.0.1]) > > by new.community.tummy.com (Postfix) with ESMTP id D5B1F20347C2; > > Fri, 28 Oct 2005 21:31:31 -0600 (MDT) > > X-Original-To: linux-ha-cvs [at] lists > > Delivered-To: mailman+post-linux-ha-cvs [at] newcommunity > > Received: by new.community.tummy.com (Postfix, from userid 547) > > id 5DD442034025; Fri, 28 Oct 2005 21:31:30 -0600 (MDT) > > To: linux-ha-cvs [at] lists > > Message-Id: <20051029033130.5DD442034025 [at] new> > > Date: Fri, 28 Oct 2005 21:31:30 -0600 (MDT) > > From: linux-ha-cvs [at] lists > > Subject: [Linux-ha-cvs] Linux-HA CVS: lib by sunjd from > > X-BeenThere: linux-ha-cvs [at] lists > > X-Mailman-Version: 2.1.5 > > Precedence: list > > Reply-To: linux-ha-dev [at] lists > > List-Id: Linux-HA CVS commits <linux-ha-cvs.lists.linux-ha.org> > > List-Unsubscribe: <http://lists.community.tummy.com/mailman/listinfo/linux-ha-cvs>, > > <mailto:linux-ha-cvs-request [at] lists?subject=unsubscribe> > > List-Post: <mailto:linux-ha-cvs [at] lists> > > List-Help: <mailto:linux-ha-cvs-request [at] lists?subject=help> > > List-Subscribe: <http://lists.community.tummy.com/mailman/listinfo/linux-ha-cvs>, > > <mailto:linux-ha-cvs-request [at] lists?subject=subscribe> > > Sender: linux-ha-cvs-bounces [at] lists > > Errors-To: linux-ha-cvs-bounces [at] lists > > X-DCC-Misty-Metrics: demo.bb4.com 1170; Body=2 Fuz1=2 Fuz2=2 > > X-SpamCop-Checked: 192.168.1.103 65.39.147.83 198.49.126.209 127.0.0.1 > > > > linux-ha CVS committal > > > > Author : sunjd > > Host : > > Project : linux-ha > > Module : lib > > > > Dir : linux-ha/lib/fencing > > > > > > Modified Files: > > stonithd_lib.c > > > > > > Log Message: > > permit private_data be null > > =================================================================== > > RCS file: /home/cvs/linux-ha/linux-ha/lib/fencing/stonithd_lib.c,v > > retrieving revision 1.18 > > retrieving revision 1.19 > > diff -u -3 -r1.18 -r1.19 > > --- stonithd_lib.c 24 Oct 2005 14:57:44 -0000 1.18 > > +++ stonithd_lib.c 29 Oct 2005 03:31:29 -0000 1.19 > > @@ -283,8 +283,6 @@ > > ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) != HA_OK) > > ||(op->node_uuid == NULL > > || ha_msg_add(request, F_STONITHD_NODE_UUID, op->node_uuid) != HA_OK) > > - ||(op->private_data == NULL > > - || ha_msg_add(request, F_STONITHD_PDATA, op->private_data) != HA_OK) > > ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout) > > != HA_OK) ) { > > stdlib_log(LOG_ERR, "stonithd_node_fence: " > > @@ -292,6 +290,14 @@ > > ZAPMSG(request); > > return ST_FAIL; > > } > > + if (op->private_data != NULL) { > > + if ( ha_msg_add(request, F_STONITHD_PDATA, op->private_data) != HA_OK) { > > + stdlib_log(LOG_ERR, "stonithd_node_fence: " > > + "Failed to add F_STONITHD_PDATA field to ha_msg."); > > + ZAPMSG(request); > > + return ST_FAIL; > > + } > > + } > > > > /* Send the stonith request message */ > > if (msg2ipcchan(request, chan) != HA_OK) { > > > > > > _______________________________________________ > > Linux-ha-cvs mailing list > > Linux-ha-cvs [at] lists > > http://lists.community.tummy.com/mailman/listinfo/linux-ha-cvs > > > > > _______________________________________________ > > Linux-HA mailing list > > Linux-HA [at] lists > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > Linux-HA [at] lists > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems
|