
linux-ha at thomas-alfeld
Aug 3, 2005, 4:17 AM
Post #6 of 7
(1449 views)
Permalink
|
|
AW: AW: ERROR: 100 NULL cf-read() running heartbeat 2.0.0
[In reply to]
|
|
Okay, Alan. My boxes has 6 interfaces and i've setup 3 bonds. Bond2 is for Bcast heartbeat and DRBD with 192.168.1.x IPs. The nodes here are cross connected. Nagios checks very often other computers via bond0 and bond1 interface and the other node on bond2 too. I'll try what you said. Perhaps I didn't shutdown heartbeat nicely while changing the configuration. Thanks a lot! Ulrich H. Thomas -- Ulrich H. Thomas Ham-Call: DG8OBZ Schützenweg 42 ICQ : 3188985 D-31061 Alfeld (Leine) Germany -----Ursprüngliche Nachricht----- Von: linux-ha-bounces [at] lists [mailto:linux-ha-bounces [at] lists] Im Auftrag von Alan Robertson Gesendet: Mittwoch, 3. August 2005 12:15 An: General Linux-HA mailing list Betreff: Re: AW: [Linux-HA] ERROR: 100 NULL cf-read() running heartbeat 2.0.0 Ulrich H. Thomas wrote: > Okay here are the log file an the configs. > > http://www.rz.unibw-muenchen.de/~j4tu0736/ha/ > > Sorry, but my webmailer have a problem with the attachments. Hope > that´s now okay. Thanks for the information. There are some inconsistencies in it below... Here are the interfaces you show in your ha.cf file. serial /dev/ttyS1 # Linux bcast bond2 # Linux ping_group group1 10.102.101.113 10.102.101.241 You're doing broadcast heartbeats over a channel bonded interface, and you're pinging over a single ping group. And here are the processes you show as running from your logs from the occurance at 06:00:03. 31180 master control process 31182 FIFO reader 31183 serial write 31184 serial read 31185 bcast write 31186 bcast read 31186 ping_group write 31188 ping_group read 31189 ??? write WHERE DID THIS COME FROM? 31190 ??? read WHERE DID THIS COME FROM? Aug 2 06:00:03 bnhpsryy heartbeat: [31188]: ERROR: 100 NULL vf->read() returns in a row. Exiting.: Resource temporarily unavailable Aug 2 06:00:03 bnhpsryy heartbeat: [31190]: ERROR: 100 NULL vf->read() returns in a row. Exiting.: Resource temporarily unavailable The last two processes should not exist. Did you maybe delete a ping_group from ha.cf after you started heartbeat? If so, then the explanation below really makes lots of sense... In any case, it looks like the errors are coming from the "ping" interface - assuming you deleted from the end and not from the middle of the ha.cf file. PLEASE explain what you did to the ha.cf file after this run, and before attaching the ha.cf file. If you just deleted another ping_group, and it wasn't above the bcast in the ha.cf, then, this is the code which is failing 100 times in a row in ping_group.c: *lenp = 0; if ((numbytes=recvfrom(ei->sock, (void *) &buf.cbuf , sizeof(buf.cbuf)-1, 0, (struct sockaddr *)&their_addr , &addr_len)) < 0) { if (errno != EINTR) { PILCallLog(LOG, PIL_CRIT, "Error receiving from socket: %s" , strerror(errno)); } return(NULL); } OR maybe, it's failing later on, since we're not getting any other failure messages, and 100 EINTRs in a row seems unlikely... if(!node) { return(NULL); } Or maybe this code... msg = wirefmt2msg(msgstart, bufmax - msgstart, MSG_NEEDAUTH); if(msg == NULL) { return(NULL); } Now what either of these would mean is that some process on this machine is pinging something else on the machine and has gotten back 100 packets before we got back any of our own ping packets... Is it possible that something on this machine (nagios?) is doing massive pings to other machines periodically? Like maybe this? Aug 1 15:57:03 bnhpsryy nagios: HOST ALERT: EDS_eds-db;UP;SOFT;3;PING OK - Packet loss = 0%, RTA = 1.44 ms I think nagios is pinging the heck out of something, and it's making our code a little sick. Now, this isn't to say that this _should_ make us sick, but at least it's a start on diagnosing it. Can you turn off - or slow down the nagios ping activity for a while and see if this helps? Aug 1 15:52:43 bnhpsryy nagios: SERVICE ALERT: WAP-GW-bypass-LH-sc-wap;PING bnsncla2;WARNING;SOFT;1;PING WARNING - Packet loss = 0%, RTA = 199.56 ms Aug 1 15:53:43 bnhpsryy nagios: SERVICE ALERT: WAP-GW-bypass-LH-sc-wap;PING bnsncla2;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 0.94 ms Aug 1 15:57:03 bnhpsryy nagios: HOST ALERT: EDS_eds-db;UP;SOFT;3;PING OK - Packet loss = 0%, RTA = 1.44 ms Aug 1 15:57:03 bnhpsryy nagios: SERVICE ALERT: EDS_eds-db;PING;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds Aug 1 15:57:13 bnhpsryy nagios: HOST ALERT: EDS_eds-omb;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 1.18 ms Aug 1 15:57:13 bnhpsryy nagios: SERVICE ALERT: EDS_eds2b;PING;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds Aug 1 16:00:13 bnhpsryy nagios: SERVICE ALERT: EDS_eds-db;PING;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 1.15 ms Aug 1 16:00:23 bnhpsryy nagios: SERVICE ALERT: EDS_eds2b;PING;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 1.41 ms Aug 1 16:05:33 bnhpsryy nagios: SERVICE ALERT: MDI-bnhpsrg7;PING;CRITICAL;HARD;1;CRITICAL - Plugin timed out after 10 seconds Aug 1 16:07:13 bnhpsryy nagios: SERVICE ALERT: MDI-bnhpsrg7;PING Management;CRITICAL;HARD;1;CRITICAL - Plugin timed out after 10 seconds Aug 1 16:10:33 bnhpsryy nagios: HOST ALERT: MDI-bnhpsrg7;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 1.05 ms Aug 1 16:10:33 bnhpsryy nagios: SERVICE ALERT: MDI-bnhpsrg7;PING;OK;HARD;1;PING OK - Packet loss = 0%, RTA = 0.93 ms Aug 1 16:12:13 bnhpsryy nagios: SERVICE ALERT: MDI-bnhpsrg7;PING Management;OK;HARD;1;PING OK - Packet loss = 0%, RTA = 0.29 ms Aug 1 18:58:54 bnhpsryy nagios: HOST ALERT: EDS_eds2a;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 1.48 ms Aug 1 19:11:14 bnhpsryy nagios: SERVICE ALERT: WAP-GW-SUT-bnsnsrgk;PING Management;WARNING;SOFT;1;PING WARNING - Packet loss = 0%, RTA = 199.38 ms Aug 1 19:12:14 bnhpsryy nagios: SERVICE ALERT: WAP-GW-SUT-bnsnsrgk;PING Management;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 0.25 ms Aug 1 19:13:04 bnhpsryy nagios: SERVICE ALERT: WAP-GW-bypass-LH-sc-wap;PING bnsncla2;WARNING;SOFT;1;PING WARNING - Packet loss = 0%, RTA = 199.32 ms Aug 1 19:14:04 bnhpsryy nagios: SERVICE ALERT: WAP-GW-bypass-LH-sc-wap;PING bnsncla2;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 0.97 ms A reasonable fix might be to change the falue of "maxnullcount" in heartbeat.c to 10000 or something like that. It's really only intended to detect something which is badly broken. If it's badly broken, then it'll fail pretty soon anyway... /* Create a read child process (to read messages from hb medium) */ static void read_child(struct hb_media* mp) { IPC_Channel* ourchan = mp->rchan[P_READFD]; int nullcount=0; const int maxnullcount=10000; -- Alan Robertson <alanr [at] unix> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce _______________________________________________ Linux-HA mailing list Linux-HA [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha _______________________________________________ Linux-HA mailing list Linux-HA [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha
|