Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Dev

Heartbeat process failure and log message

 

 

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded


okada.satoshi at oss

Sep 9, 2008, 12:31 AM

Post #1 of 4 (1130 views)
Permalink
Heartbeat process failure and log message

Hi,

I got unexpected ERROR message when I tested Heartbeat process failure.

ha.cf:
-----
crm on
use_logd on
keepalive 1
deadtime 10
initdead 40
warntime 5
udpport 694
bcast eth0
node node01
node node02
watchdog /dev/watchdog
-----

heartbeat version: 2.1.4
OS version: RHEL 5.1

The test procedure:
1. start heartbeat
# /etc/init.d/heartbeat start

2. kill heartbeat process
# kill -9 <"heartbeat: write" or "heartbeat: read" process>
These processes are restarted.

3. stop heartbeat
# /etc/init.d/heartbeat stop

I get ERROR message in this stop process.
---- ha-log -----
heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog write
magic character failure: closing /dev/watchdog!: Bad file descriptor
heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog close(2)
failed.: Bad file descriptor
-----------------

I think that this is the same cause as Bugzilla No.1702 and I make patch.
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1702

Please check attached patch.

Best Regards,
---
OKADA Satoshi
NTT Open Source Software Center
Attachments: heartbeat_close_watchdogfd.patch (0.82 KB)


dejanmm at fastmail

Sep 19, 2008, 7:54 AM

Post #2 of 4 (1004 views)
Permalink
Re: Heartbeat process failure and log message [In reply to]

Hi Satoshi-san,

On Tue, Sep 09, 2008 at 04:31:25PM +0900, OKADA Satoshi wrote:
> Hi,
>
> I got unexpected ERROR message when I tested Heartbeat process failure.
>
> ha.cf:
> -----
> crm on
> use_logd on
> keepalive 1
> deadtime 10
> initdead 40
> warntime 5
> udpport 694
> bcast eth0
> node node01
> node node02
> watchdog /dev/watchdog
> -----
>
> heartbeat version: 2.1.4
> OS version: RHEL 5.1
>
> The test procedure:
> 1. start heartbeat
> # /etc/init.d/heartbeat start
>
> 2. kill heartbeat process
> # kill -9 <"heartbeat: write" or "heartbeat: read" process>
> These processes are restarted.
>
> 3. stop heartbeat
> # /etc/init.d/heartbeat stop
>
> I get ERROR message in this stop process.
> ---- ha-log -----
> heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog write
> magic character failure: closing /dev/watchdog!: Bad file descriptor
> heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog close(2)
> failed.: Bad file descriptor
> -----------------
>
> I think that this is the same cause as Bugzilla No.1702 and I make patch.
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1702
>
> Please check attached patch.

Sorry for the delay on this one.

Your patch looks fine to me. Did you test it?

Thanks,

Dejan

> Best Regards,
> ---
> OKADA Satoshi
> NTT Open Source Software Center
>

> --- heartbeat/heartbeat.c.orig 2008-09-09 15:08:30.000000000 +0900
> +++ heartbeat/heartbeat.c 2008-09-09 15:10:37.000000000 +0900
> @@ -679,7 +679,7 @@
> break;
>
> case 0: /* Child */
> - close(watchdogfd);
> + hb_close_watchdog();
> curproc = &procinfo->info[fifoproc];
> cl_malloc_setstats(&curproc->memstats);
> cl_msg_setstats(&curproc->msgstats);
> @@ -798,7 +798,7 @@
> break;
>
> case 0: /* Child */
> - close(watchdogfd);
> + hb_close_watchdog();
> curproc = &procinfo->info[ourproc];
> cl_malloc_setstats(&curproc->memstats);
> cl_msg_setstats(&curproc->msgstats);
> @@ -832,7 +832,7 @@
> break;
>
> case 0: /* Child */
> - close(watchdogfd);
> + hb_close_watchdog();
> curproc = &procinfo->info[ourproc];
> cl_malloc_setstats(&curproc->memstats);
> cl_msg_setstats(&curproc->msgstats);

> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


okada.satoshi at oss

Sep 25, 2008, 3:39 AM

Post #3 of 4 (987 views)
Permalink
Re: Heartbeat process failure and log message [In reply to]

Hi Dejan,


Thank you for your reply.

> Hi Satoshi-san,
>
> On Tue, Sep 09, 2008 at 04:31:25PM +0900, OKADA Satoshi wrote:
>> Hi,
>>
>> I got unexpected ERROR message when I tested Heartbeat process failure.
>>
>> ha.cf:
>> -----
>> crm on
>> use_logd on
>> keepalive 1
>> deadtime 10
>> initdead 40
>> warntime 5
>> udpport 694
>> bcast eth0
>> node node01
>> node node02
>> watchdog /dev/watchdog
>> -----
>>
>> heartbeat version: 2.1.4
>> OS version: RHEL 5.1
>>
>> The test procedure:
>> 1. start heartbeat
>> # /etc/init.d/heartbeat start
>>
>> 2. kill heartbeat process
>> # kill -9 <"heartbeat: write" or "heartbeat: read" process>
>> These processes are restarted.
>>
>> 3. stop heartbeat
>> # /etc/init.d/heartbeat stop
>>
>> I get ERROR message in this stop process.
>> ---- ha-log -----
>> heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog write
>> magic character failure: closing /dev/watchdog!: Bad file descriptor
>> heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog close(2)
>> failed.: Bad file descriptor
>> -----------------
>>
>> I think that this is the same cause as Bugzilla No.1702 and I make patch.
>> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1702
>>
>> Please check attached patch.
>
> Sorry for the delay on this one.
>
> Your patch looks fine to me. Did you test it?


Yes.

I tested some operations, and checked logs and resources
status by usingcrm_mon. I was not able to find the problem.


---
the outline of test:
Two node (Active-Standby)
watchdog directive in ha.cf
resources:rscGroup(IPaddr, pgsq, Filesystem)

1. I tested the behavior of the Heartbeat when target processes did not down.
Target processes are "FIFO reader", "write bcast", "read bcast",
"write ping" and "read ping".
1-1 resources fails, and fail-over.
1-2 ping communication fails, and fail-over.
1-3 master control process killed, and node is rebooted by watchdog.
1-4 run Heartbeat continuously for about one hour.

2. I tested the behavior of the Heartbeat when target processes down.
2-1 target processes killed and restarted these processes.
Afterwards, resources fails, and fail-over.
2-2 "read ping" and "write ping" processes killed.
Afterwards, ping communicatin fails and fail-over.
2-3 Target process killed and restearted processes.
Afterwards, run Heartbeat continuously for about one hour.



Best Regards,

OKADA Satoshi
NTT Open Source Software Center
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dejanmm at fastmail

Sep 25, 2008, 4:29 AM

Post #4 of 4 (988 views)
Permalink
Re: Heartbeat process failure and log message [In reply to]

Hi Satoshi-san,

On Thu, Sep 25, 2008 at 07:39:13PM +0900, OKADA Satoshi wrote:
> Hi Dejan,
>
>
> Thank you for your reply.
>
>> Hi Satoshi-san,
>>
>> On Tue, Sep 09, 2008 at 04:31:25PM +0900, OKADA Satoshi wrote:
>>> Hi,
>>>
>>> I got unexpected ERROR message when I tested Heartbeat process failure.
>>>
>>> ha.cf:
>>> -----
>>> crm on
>>> use_logd on
>>> keepalive 1
>>> deadtime 10
>>> initdead 40
>>> warntime 5
>>> udpport 694
>>> bcast eth0
>>> node node01
>>> node node02
>>> watchdog /dev/watchdog
>>> -----
>>>
>>> heartbeat version: 2.1.4
>>> OS version: RHEL 5.1
>>>
>>> The test procedure:
>>> 1. start heartbeat
>>> # /etc/init.d/heartbeat start
>>>
>>> 2. kill heartbeat process
>>> # kill -9 <"heartbeat: write" or "heartbeat: read" process>
>>> These processes are restarted.
>>>
>>> 3. stop heartbeat
>>> # /etc/init.d/heartbeat stop
>>>
>>> I get ERROR message in this stop process.
>>> ---- ha-log -----
>>> heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog write
>>> magic character failure: closing /dev/watchdog!: Bad file descriptor
>>> heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog close(2)
>>> failed.: Bad file descriptor
>>> -----------------
>>>
>>> I think that this is the same cause as Bugzilla No.1702 and I make patch.
>>> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1702
>>>
>>> Please check attached patch.
>>
>> Sorry for the delay on this one.
>>
>> Your patch looks fine to me. Did you test it?
>
>
> Yes.
>
> I tested some operations, and checked logs and resources
> status by usingcrm_mon. I was not able to find the problem.
>
>
> ---
> the outline of test:
> Two node (Active-Standby)
> watchdog directive in ha.cf
> resources:rscGroup(IPaddr, pgsq, Filesystem)
>
> 1. I tested the behavior of the Heartbeat when target processes did not down.
> Target processes are "FIFO reader", "write bcast", "read bcast",
> "write ping" and "read ping".
> 1-1 resources fails, and fail-over.
> 1-2 ping communication fails, and fail-over.
> 1-3 master control process killed, and node is rebooted by watchdog.
> 1-4 run Heartbeat continuously for about one hour.
>
> 2. I tested the behavior of the Heartbeat when target processes down.
> 2-1 target processes killed and restarted these processes.
> Afterwards, resources fails, and fail-over.
> 2-2 "read ping" and "write ping" processes killed.
> Afterwards, ping communicatin fails and fail-over.
> 2-3 Target process killed and restearted processes.
> Afterwards, run Heartbeat continuously for about one hour.
>

Just applied your patch.

Cheers,

Dejan

>
> Best Regards,
>
> OKADA Satoshi
> NTT Open Source Software Center
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.