
casey at shobe
Nov 17, 2009, 6:45 AM
Post #1 of 3
(1071 views)
Permalink
|
|
Heartbeat not taking over - ERROR: NV failure (msgfromsteam)
|
|
Hello, I'm running Debian 5.0 (Lenny) with heartbeat 2.1.3 installed from Debian packages. I have configured heartbeat in a manner that I feel is correct and it seems to work when I test it by manually stopping heartbeat on the primary node (the other node then takes over). However the hardware in these machines is old and somewhat unreliable - I think there may be a RAM issue. Every few days, the master will lock up and have no video output, not respond to ssh, etc. and heartbeat and DRBD do not receive any response from it. However it DOES still respond to pings. I've seen this with a bunch of servers in the past, so I'm assuming it's not an unusual condition and others are dealing with it via heartbeat successfully. However when this happens, heartbeat on the still-running server does NOT take over. Here's all that I see in the log: Nov 17 12:43:45 radha heartbeat: [28049]: WARN: No STONITH device configured. Nov 17 12:43:45 radha heartbeat: [28049]: WARN: Shared disks are not protected. Nov 17 12:43:45 radha heartbeat: [28049]: info: Resources being acquired from krishna. Nov 17 12:43:45 radha heartbeat: [28049]: info: Link krishna:eth1 dead. Nov 17 12:43:45 radha heartbeat: [5552]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL Nov 17 12:43:45 radha heartbeat: [28051]: WARN: ha_msg_add_nv_depth: line doesn't contain '=' Nov 17 12:43:45 radha heartbeat: [28051]: info: >>> Nov 17 12:43:45 radha heartbeat: [28051]: ERROR: NV failure (msgfromsteam): [>>>#012] Nov 17 12:43:45 radha heartbeat: [5553]: info: Local Resource acquisition completed. Nov 17 12:43:45 radha heartbeat: [28049]: debug: StartNextRemoteRscReq(): child count 1 If I then leave it in this state and reboot the failed node, it does not become primary either, leaving both machines in standby state. This is the log output from the rebooted machine: Nov 17 14:38:08 krishna heartbeat: [2358]: info: Heartbeat generation: 1255459493 Nov 17 14:38:08 krishna heartbeat: [2358]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 Nov 17 14:38:08 krishna heartbeat: [2358]: info: glib: ucast: bound send socket to device: eth1 Nov 17 14:38:08 krishna heartbeat: [2358]: info: glib: ucast: bound receive socket to device: eth1 Nov 17 14:38:08 krishna heartbeat: [2358]: info: glib: ucast: started on port 694 interface eth1 to 172.16.0.1 Nov 17 14:38:08 krishna heartbeat: [2358]: info: G_main_add_TriggerHandler: Added signal manual handler Nov 17 14:38:08 krishna heartbeat: [2358]: info: G_main_add_TriggerHandler: Added signal manual handler Nov 17 14:38:08 krishna heartbeat: [2358]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Nov 17 14:38:08 krishna heartbeat: [2358]: info: Local status now set to: 'up' Nov 17 14:38:09 krishna heartbeat: [2358]: info: Link radha:eth1 up. Nov 17 14:38:10 krishna heartbeat: [2358]: info: Status update for node radha: status active Nov 17 14:38:10 krishna heartbeat: [2358]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 230 ms (> 50 ms) (GSource: 0x8272e40) Nov 17 14:38:10 krishna heartbeat: [2429]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL Nov 17 14:38:10 krishna harc[2429]: [2436]: info: Running /etc/ha.d/rc.d/status status Nov 17 14:38:10 krishna heartbeat: [2358]: info: Comm_now_up(): updating status to active Nov 17 14:38:10 krishna heartbeat: [2358]: info: Local status now set to: 'active' Nov 17 14:38:11 krishna heartbeat: [2358]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 480 ms (> 50 ms) (GSource: 0x8272e40) Nov 17 14:38:11 krishna heartbeat: [2358]: info: remote resource transition completed. Nov 17 14:38:11 krishna heartbeat: [2358]: info: remote resource transition completed. Nov 17 14:38:11 krishna heartbeat: [2358]: info: Local Resource acquisition completed. (none) Nov 17 14:38:11 krishna heartbeat: [2358]: info: Initial resource acquisition complete (T_RESOURCES(them)) Nov 17 14:38:11 krishna heartbeat: [2358]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 480 ms (> 100 ms) before being called (GSource: 0x8274ef0) Nov 17 14:38:11 krishna heartbeat: [2358]: info: G_SIG_dispatch: started at 1718071206 should have started at 1718071158 The only error I see is the mysterious one referenced in the subject line. Any advice please? Cheers, -- Casey Allen Shobe casey [at] shobe _______________________________________________ Linux-HA mailing list Linux-HA [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|