Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

heartbeat takes all cpu

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


katkovsurvey at yahoo

Nov 8, 2010, 1:38 PM

Post #1 of 5 (3616 views)
Permalink
heartbeat takes all cpu

Reporting an issue. Thank you very much for any feedback

today heartbeat process took all the CPU (99%-100%) and load went up.
it put this message to log repeating 14 times a seccond:
heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for
retransmit request took too long to execute: 20 ms (> 10 ms) (GSource:
0x8ade118)

then this message to log repeating:
heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for
retransmit request was delayed 540 ms (> 500 ms) before being called (GSource:
0x9152988)
heartbeat: [2553]: info: Gmain_timeout_dispatch: started at 788890528 should
have started at 788890474


I have very simple setup of two Nodes having only IP address as the heartbeat
resource.
Heartbeat communicates on the local network (not with crossover network cable
and not with serial cable). So first thing I thought it's related to network
noise.
Nodes could connect to each other on the local network fine and no problems have
been reported with local network.
Secondary node was fine without any errors and showing two nodes online with
primary node having resource IP.

To fix i had to reboot primary node (the one with the problems):
Second node took control of IP as i rebooted and released it back when main node
came back online. So it's all fine working now.

Please tell me any ideas why/what happened. If it can be a bug and if i need to
upgrade to latest version. Current version is heartbeat-2 2.1.3-2.

I also have heartbeat 2.1.3-2.
OS is Ubuntu 8.04.4

thank you very much!!!
-Sam



_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


alanr at unix

Nov 8, 2010, 4:51 PM

Post #2 of 5 (3538 views)
Permalink
Re: heartbeat takes all cpu [In reply to]

Quoting Vsevolod Katkov <katkovsurvey [at] yahoo>:

> Reporting an issue. Thank you very much for any feedback
>
> today heartbeat process took all the CPU (99%-100%) and load went up.
> it put this message to log repeating 14 times a seccond:
> heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for
> retransmit request took too long to execute: 20 ms (> 10 ms) (GSource:
> 0x8ade118)
>
> then this message to log repeating:
> heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for
> retransmit request was delayed 540 ms (> 500 ms) before being called
> (GSource:
> 0x9152988)
> heartbeat: [2553]: info: Gmain_timeout_dispatch: started at 788890528 should
> have started at 788890474
>
>
> I have very simple setup of two Nodes having only IP address as the heartbeat
> resource.
> Heartbeat communicates on the local network (not with crossover network cable
> and not with serial cable). So first thing I thought it's related to network
> noise.
> Nodes could connect to each other on the local network fine and no
> problems have
> been reported with local network.
> Secondary node was fine without any errors and showing two nodes online with
> primary node having resource IP.
>
> To fix i had to reboot primary node (the one with the problems):
> Second node took control of IP as i rebooted and released it back
> when main node
> came back online. So it's all fine working now.
>
> Please tell me any ideas why/what happened. If it can be a bug and
> if i need to
> upgrade to latest version. Current version is heartbeat-2 2.1.3-2.
>
> I also have heartbeat 2.1.3-2.
> OS is Ubuntu 8.04.4
>
> thank you very much!!!
> -Sam

If heartbeat had gone into an infinite loop, you would not be able to
do anything at all. These messages indicate that something (most
likely something else) was consuming a lot of CPU.

The heartbeat system has a number of processes. Can you say which one
you believe was consuming a lot of CPU?
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


katkovsurvey at yahoo

Nov 8, 2010, 8:09 PM

Post #3 of 5 (3539 views)
Permalink
Re: heartbeat takes all cpu [In reply to]

thank you for replying Alan.
when running "top" command, the heartbeat process was showing in COMMAND column
going from 99% to 100%.
i didn't check for PID unfortunately.



________________________________
From: Alan Robertson <alanr [at] unix>
To: linux-ha [at] lists
Sent: Mon, November 8, 2010 4:51:08 PM
Subject: Re: [Linux-HA] heartbeat takes all cpu

Quoting Vsevolod Katkov <katkovsurvey [at] yahoo>:

> Reporting an issue. Thank you very much for any feedback
>
> today heartbeat process took all the CPU (99%-100%) and load went up.
> it put this message to log repeating 14 times a seccond:
> heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for
> retransmit request took too long to execute: 20 ms (> 10 ms) (GSource:
> 0x8ade118)
>
> then this message to log repeating:
> heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for
> retransmit request was delayed 540 ms (> 500 ms) before being called
> (GSource:
> 0x9152988)
> heartbeat: [2553]: info: Gmain_timeout_dispatch: started at 788890528 should
> have started at 788890474
>
>
> I have very simple setup of two Nodes having only IP address as the heartbeat
> resource.
> Heartbeat communicates on the local network (not with crossover network cable
> and not with serial cable). So first thing I thought it's related to network
> noise.
> Nodes could connect to each other on the local network fine and no
> problems have
> been reported with local network.
> Secondary node was fine without any errors and showing two nodes online with
> primary node having resource IP.
>
> To fix i had to reboot primary node (the one with the problems):
> Second node took control of IP as i rebooted and released it back
> when main node
> came back online. So it's all fine working now.
>
> Please tell me any ideas why/what happened. If it can be a bug and
> if i need to
> upgrade to latest version. Current version is heartbeat-2 2.1.3-2.
>
> I also have heartbeat 2.1.3-2.
> OS is Ubuntu 8.04.4
>
> thank you very much!!!
> -Sam

If heartbeat had gone into an infinite loop, you would not be able to
do anything at all. These messages indicate that something (most
likely something else) was consuming a lot of CPU.

The heartbeat system has a number of processes. Can you say which one
you believe was consuming a lot of CPU?
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems




_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lars.ellenberg at linbit

Nov 9, 2010, 5:49 AM

Post #4 of 5 (3565 views)
Permalink
Re: heartbeat takes all cpu [In reply to]

On Mon, Nov 08, 2010 at 05:51:08PM -0700, Alan Robertson wrote:
> Quoting Vsevolod Katkov <katkovsurvey [at] yahoo>:
>
> > Reporting an issue. Thank you very much for any feedback
> >
> > today heartbeat process took all the CPU (99%-100%) and load went up.
> > it put this message to log repeating 14 times a seccond:
> > heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for
> > retransmit request took too long to execute: 20 ms (> 10 ms) (GSource:
> > 0x8ade118)
> >
> > then this message to log repeating:
> > heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for
> > retransmit request was delayed 540 ms (> 500 ms) before being called
> > (GSource:
> > 0x9152988)
> > heartbeat: [2553]: info: Gmain_timeout_dispatch: started at 788890528 should
> > have started at 788890474
> >
> >
> > I have very simple setup of two Nodes having only IP address as the heartbeat
> > resource.
> > Heartbeat communicates on the local network (not with crossover network cable
> > and not with serial cable). So first thing I thought it's related to network
> > noise.
> > Nodes could connect to each other on the local network fine and no
> > problems have
> > been reported with local network.
> > Secondary node was fine without any errors and showing two nodes online with
> > primary node having resource IP.
> >
> > To fix i had to reboot primary node (the one with the problems):
> > Second node took control of IP as i rebooted and released it back
> > when main node
> > came back online. So it's all fine working now.
> >
> > Please tell me any ideas why/what happened. If it can be a bug and
> > if i need to
> > upgrade to latest version. Current version is heartbeat-2 2.1.3-2.
> >
> > I also have heartbeat 2.1.3-2.
> > OS is Ubuntu 8.04.4
> >
> > thank you very much!!!
> > -Sam
>
> If heartbeat had gone into an infinite loop, you would not be able to
> do anything at all. These messages indicate that something (most
> likely something else) was consuming a lot of CPU.

More likely it is because once you start to have packet loss,
rexmit packets are generated, causing more packets,
more packet loss, more rexmit requests...
a rexmit storm results in packet storm and high CPU consumption of
heartbeat processes.

This used to be mitigated by randomized delays of rexmit requests
and a max_rexmit_delay parameter.

Unfortunately the formula used to calculate those randomized delays
had been correct for RAND_MAX == SHRT_MAX, but would always return 0
for RAND_MAX == INT_MAX. The later is the case for "modern" libc rand()
(since several years already), which is why this "rexmit causes packet
storm" problem reoccurred.

The fix is in mercurial, but we have not gotten around to actually
release it yet.

There have been other possible causes for heartbeat using up too much
CPU in its communication layers, some related to too big payload
packets, but I don't remember the details right now.

I'd recommend using a more recent heartbeat + pacemaker anyways.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


katkovsurvey at yahoo

Nov 9, 2010, 7:08 PM

Post #5 of 5 (3528 views)
Permalink
Re: heartbeat takes all cpu [In reply to]

thank you Lars for the feedback




________________________________
From: Lars Ellenberg <lars.ellenberg [at] linbit>
To: linux-ha [at] lists
Sent: Tue, November 9, 2010 5:49:54 AM
Subject: Re: [Linux-HA] heartbeat takes all cpu

On Mon, Nov 08, 2010 at 05:51:08PM -0700, Alan Robertson wrote:
> Quoting Vsevolod Katkov <katkovsurvey [at] yahoo>:
>
> > Reporting an issue. Thank you very much for any feedback
> >
> > today heartbeat process took all the CPU (99%-100%) and load went up.
> > it put this message to log repeating 14 times a seccond:
> > heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for
> > retransmit request took too long to execute: 20 ms (> 10 ms) (GSource:
> > 0x8ade118)
> >
> > then this message to log repeating:
> > heartbeat: [2553]: WARN: Gmain_timeout_dispatch: Dispatch function for
> > retransmit request was delayed 540 ms (> 500 ms) before being called
> > (GSource:
> > 0x9152988)
> > heartbeat: [2553]: info: Gmain_timeout_dispatch: started at 788890528 should
> > have started at 788890474
> >
> >
> > I have very simple setup of two Nodes having only IP address as the
heartbeat
> > resource.
> > Heartbeat communicates on the local network (not with crossover network
cable
> > and not with serial cable). So first thing I thought it's related to network
> > noise.
> > Nodes could connect to each other on the local network fine and no
> > problems have
> > been reported with local network.
> > Secondary node was fine without any errors and showing two nodes online with
> > primary node having resource IP.
> >
> > To fix i had to reboot primary node (the one with the problems):
> > Second node took control of IP as i rebooted and released it back
> > when main node
> > came back online. So it's all fine working now.
> >
> > Please tell me any ideas why/what happened. If it can be a bug and
> > if i need to
> > upgrade to latest version. Current version is heartbeat-2 2.1.3-2.
> >
> > I also have heartbeat 2.1.3-2.
> > OS is Ubuntu 8.04.4
> >
> > thank you very much!!!
> > -Sam
>
> If heartbeat had gone into an infinite loop, you would not be able to
> do anything at all. These messages indicate that something (most
> likely something else) was consuming a lot of CPU.

More likely it is because once you start to have packet loss,
rexmit packets are generated, causing more packets,
more packet loss, more rexmit requests...
a rexmit storm results in packet storm and high CPU consumption of
heartbeat processes.

This used to be mitigated by randomized delays of rexmit requests
and a max_rexmit_delay parameter.

Unfortunately the formula used to calculate those randomized delays
had been correct for RAND_MAX == SHRT_MAX, but would always return 0
for RAND_MAX == INT_MAX. The later is the case for "modern" libc rand()
(since several years already), which is why this "rexmit causes packet
storm" problem reoccurred.

The fix is in mercurial, but we have not gotten around to actually
release it yet.

There have been other possible causes for heartbeat using up too much
CPU in its communication layers, some related to too big payload
packets, but I don't remember the details right now.

I'd recommend using a more recent heartbeat + pacemaker anyways.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems




_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.