Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

Heartbeat not taking over - ERROR: NV failure (msgfromsteam)

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


casey at shobe

Nov 17, 2009, 6:45 AM

Post #1 of 3 (1071 views)
Permalink
Heartbeat not taking over - ERROR: NV failure (msgfromsteam)

Hello,

I'm running Debian 5.0 (Lenny) with heartbeat 2.1.3 installed from Debian
packages.

I have configured heartbeat in a manner that I feel is correct and it seems
to work when I test it by manually stopping heartbeat on the primary node
(the other node then takes over).

However the hardware in these machines is old and somewhat unreliable - I
think there may be a RAM issue. Every few days, the master will lock up and
have no video output, not respond to ssh, etc. and heartbeat and DRBD do not
receive any response from it. However it DOES still respond to pings. I've
seen this with a bunch of servers in the past, so I'm assuming it's not an
unusual condition and others are dealing with it via heartbeat successfully.

However when this happens, heartbeat on the still-running server does NOT
take over. Here's all that I see in the log:

Nov 17 12:43:45 radha heartbeat: [28049]: WARN: No STONITH device
configured.
Nov 17 12:43:45 radha heartbeat: [28049]: WARN: Shared disks are not
protected.
Nov 17 12:43:45 radha heartbeat: [28049]: info: Resources being acquired
from krishna.
Nov 17 12:43:45 radha heartbeat: [28049]: info: Link krishna:eth1 dead.
Nov 17 12:43:45 radha heartbeat: [5552]: debug: notify_world: setting
SIGCHLD Handler to SIG_DFL
Nov 17 12:43:45 radha heartbeat: [28051]: WARN: ha_msg_add_nv_depth: line
doesn't contain '='
Nov 17 12:43:45 radha heartbeat: [28051]: info: >>>
Nov 17 12:43:45 radha heartbeat: [28051]: ERROR: NV failure (msgfromsteam):
[>>>#012]
Nov 17 12:43:45 radha heartbeat: [5553]: info: Local Resource acquisition
completed.
Nov 17 12:43:45 radha heartbeat: [28049]: debug: StartNextRemoteRscReq():
child count 1

If I then leave it in this state and reboot the failed node, it does not
become primary either, leaving both machines in standby state. This is the
log output from the rebooted machine:

Nov 17 14:38:08 krishna heartbeat: [2358]: info: Heartbeat generation:
1255459493
Nov 17 14:38:08 krishna heartbeat: [2358]: info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
Nov 17 14:38:08 krishna heartbeat: [2358]: info: glib: ucast: bound send
socket to device: eth1
Nov 17 14:38:08 krishna heartbeat: [2358]: info: glib: ucast: bound receive
socket to device: eth1
Nov 17 14:38:08 krishna heartbeat: [2358]: info: glib: ucast: started on
port 694 interface eth1 to 172.16.0.1
Nov 17 14:38:08 krishna heartbeat: [2358]: info: G_main_add_TriggerHandler:
Added signal manual handler
Nov 17 14:38:08 krishna heartbeat: [2358]: info: G_main_add_TriggerHandler:
Added signal manual handler
Nov 17 14:38:08 krishna heartbeat: [2358]: info: G_main_add_SignalHandler:
Added signal handler for signal 17
Nov 17 14:38:08 krishna heartbeat: [2358]: info: Local status now set to:
'up'
Nov 17 14:38:09 krishna heartbeat: [2358]: info: Link radha:eth1 up.
Nov 17 14:38:10 krishna heartbeat: [2358]: info: Status update for node
radha: status active
Nov 17 14:38:10 krishna heartbeat: [2358]: WARN: G_CH_dispatch_int: Dispatch
function for read child took too long to execute: 230 ms (> 50 ms) (GSource:
0x8272e40)
Nov 17 14:38:10 krishna heartbeat: [2429]: debug: notify_world: setting
SIGCHLD Handler to SIG_DFL
Nov 17 14:38:10 krishna harc[2429]: [2436]: info: Running
/etc/ha.d/rc.d/status status
Nov 17 14:38:10 krishna heartbeat: [2358]: info: Comm_now_up(): updating
status to active
Nov 17 14:38:10 krishna heartbeat: [2358]: info: Local status now set to:
'active'
Nov 17 14:38:11 krishna heartbeat: [2358]: WARN: G_CH_dispatch_int: Dispatch
function for read child took too long to execute: 480 ms (> 50 ms) (GSource:
0x8272e40)
Nov 17 14:38:11 krishna heartbeat: [2358]: info: remote resource transition
completed.
Nov 17 14:38:11 krishna heartbeat: [2358]: info: remote resource transition
completed.
Nov 17 14:38:11 krishna heartbeat: [2358]: info: Local Resource acquisition
completed. (none)
Nov 17 14:38:11 krishna heartbeat: [2358]: info: Initial resource
acquisition complete (T_RESOURCES(them))
Nov 17 14:38:11 krishna heartbeat: [2358]: WARN: G_SIG_dispatch: Dispatch
function for SIGCHLD was delayed 480 ms (> 100 ms) before being called
(GSource: 0x8274ef0)
Nov 17 14:38:11 krishna heartbeat: [2358]: info: G_SIG_dispatch: started at
1718071206 should have started at 1718071158

The only error I see is the mysterious one referenced in the subject line.
Any advice please?

Cheers,
--
Casey Allen Shobe
casey [at] shobe
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Rolf.Schmidt at novell

Nov 18, 2009, 3:45 AM

Post #2 of 3 (965 views)
Permalink
Re: Heartbeat not taking over - ERROR: NV failure (msgfromsteam) [In reply to]

Hi,

On Tue, 17 Nov 2009, Casey Allen Shobe wrote:

> Nov 17 12:43:45 radha heartbeat: [28049]: WARN: No STONITH device
> configured.

Do not use it without STONITH. I assume you have STONITH enabled for the cluster
in crm_config but no device configured.

And update to the latest version.


regards,

Rolf Schmidt


_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


casey at shobe

Nov 18, 2009, 12:22 PM

Post #3 of 3 (942 views)
Permalink
Re: Heartbeat not taking over - ERROR: NV failure (msgfromsteam) [In reply to]

On Wed, Nov 18, 2009 at 6:45 AM, Rolf Schmidt <Rolf.Schmidt [at] novell>wrote:

> Do not use it without STONITH. I assume you have STONITH enabled for the
> cluster in crm_config but no device configured.
>

I don't intentionally have any stonith configured - I don't have any
hardware to enable this. I cannot find any crm_config at all.


> And update to the latest version.
>

I need to run the version I have that Debian supports.

--
Casey Allen Shobe
casey [at] shobe
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.