
katkovsurvey at yahoo
Nov 9, 2010, 7:08 PM
Post #5 of 5
(3528 views)
Permalink
|
thank you Lars for the feedback ________________________________ From: Lars Ellenberg <lars.ellenberg [at] linbit> To: linux-ha [at] lists Sent: Tue, November 9, 2010 5:49:54 AM Subject: Re: [Linux-HA] heartbeat takes all cpu On Mon, Nov 08, 2010 at 05:51:08PM -0700, Alan Robertson wrote: > Quoting Vsevolod Katkov <katkovsurvey [at] yahoo>: > > > Reporting an issue. Thank you very much for any feedback > > > > today heartbeat process took all the CPU (99%-100%) and load went up. > > it put this message to log repeating 14 times a seccond: > > heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for > > retransmit request took too long to execute: 20 ms (> 10 ms) (GSource: > > 0x8ade118) > > > > then this message to log repeating: > > heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for > > retransmit request was delayed 540 ms (> 500 ms) before being called > > (GSource: > > 0x9152988) > > heartbeat: [2553]: info: Gmain_timeout_dispatch: started at 788890528 should > > have started at 788890474 > > > > > > I have very simple setup of two Nodes having only IP address as the heartbeat > > resource. > > Heartbeat communicates on the local network (not with crossover network cable > > and not with serial cable). So first thing I thought it's related to network > > noise. > > Nodes could connect to each other on the local network fine and no > > problems have > > been reported with local network. > > Secondary node was fine without any errors and showing two nodes online with > > primary node having resource IP. > > > > To fix i had to reboot primary node (the one with the problems): > > Second node took control of IP as i rebooted and released it back > > when main node > > came back online. So it's all fine working now. > > > > Please tell me any ideas why/what happened. If it can be a bug and > > if i need to > > upgrade to latest version. Current version is heartbeat-2 2.1.3-2. > > > > I also have heartbeat 2.1.3-2. > > OS is Ubuntu 8.04.4 > > > > thank you very much!!! > > -Sam > > If heartbeat had gone into an infinite loop, you would not be able to > do anything at all. These messages indicate that something (most > likely something else) was consuming a lot of CPU. More likely it is because once you start to have packet loss, rexmit packets are generated, causing more packets, more packet loss, more rexmit requests... a rexmit storm results in packet storm and high CPU consumption of heartbeat processes. This used to be mitigated by randomized delays of rexmit requests and a max_rexmit_delay parameter. Unfortunately the formula used to calculate those randomized delays had been correct for RAND_MAX == SHRT_MAX, but would always return 0 for RAND_MAX == INT_MAX. The later is the case for "modern" libc rand() (since several years already), which is why this "rexmit causes packet storm" problem reoccurred. The fix is in mercurial, but we have not gotten around to actually release it yet. There have been other possible causes for heartbeat using up too much CPU in its communication layers, some related to too big payload packets, but I don't remember the details right now. I'd recommend using a more recent heartbeat + pacemaker anyways. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Linux-HA mailing list Linux-HA [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|