Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

Problem with heartBeat

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


hugo at bitmailer

Mar 24, 2000, 2:02 AM

Post #1 of 9 (674 views)
Permalink
Problem with heartBeat

Hi, I've just installed the HeartBeat-0.4.6d version on my machines
(without the 'ipconfig' bug) but there is something that no works.

I have two machines with one ethernet interface each one connected to a
hub, then I've connected them by serial port to serial port over a null
modem cable.

When I run hearbeat program in the two machines it works perferct,
the master takes a 'virtual IP' and if I stop heartbeat in the master,
the secondary one takes the 'virtual IP', OK. If I disconnect the serial
cable it keep on working OK, they use the eth0 to comunicate.

But if I disconnect the master's crossover cable from eth0, the
secondary machine doesn't take the IP, and if I ping to the 'virtual
IP' the master can't response (because it's no connected) and the
secondary one doesn't know that the master is disconnected, it doesn't
work.

Is this normal? Fake has to test the IP response? What can I do to
repare it?

Thanks.
Hugo.


alanr at suse

Mar 24, 2000, 6:35 AM

Post #2 of 9 (648 views)
Permalink
Problem with heartBeat [In reply to]

Hugo Martinez wrote:
>
> Hi, I've just installed the HeartBeat-0.4.6d version on my machines
> (without the 'ipconfig' bug) but there is something that no works.
>
> I have two machines with one ethernet interface each one connected to a
> hub, then I've connected them by serial port to serial port over a null
> modem cable.
>
> When I run hearbeat program in the two machines it works perferct,
> the master takes a 'virtual IP' and if I stop heartbeat in the master,
> the secondary one takes the 'virtual IP', OK. If I disconnect the serial
> cable it keep on working OK, they use the eth0 to comunicate.
>
> But if I disconnect the master's crossover cable from eth0, the
> secondary machine doesn't take the IP, and if I ping to the 'virtual
> IP' the master can't response (because it's no connected) and the
> secondary one doesn't know that the master is disconnected, it doesn't
> work.
>
> Is this normal? Fake has to test the IP response? What can I do to
> repare it?
>
> Thanks.
> Hugo.

This is a well-known behavior of heartbeat at this point. Heartbeat
will take over ONLY on a dead machine, not on failure of ethernet, etc.
What has been suggested is that you use the "Mon" package to shut down
heartbeat when the ethernet dies. Mon has a ping-test module for
testing whether a particular IP address is reachable.

I'm not sure who has implemented this at this point. Geoff Nordli asked
about it, but I don't know if he has some Mon config files to show you,
or not.

I suppose that what we should do is get a sample Mon config file that
works, and document it in Rudy's docs.

Geoff? Do you have this working yet :-)

Rudy: Could you get this documented?

Rudy and I have been very distracted for the last month or two for
different reasons. Rudy's wife just had their first baby (congrats to
Rudy!), and I just started working for SuSE, and started helping get
SGI's FailSafe program available to the open source community.

-- Alan Robertson
alanr [at] suse


geoff at gnaa

Mar 24, 2000, 10:27 AM

Post #3 of 9 (651 views)
Permalink
Problem with heartBeat [In reply to]

Hi Alan.

I probably won't get to install this, until the middle
of April. I am just in the process of gathering info
to make sure that it is feasible.

geoff



-----Original Message-----
From: alanr [at] rmi [mailto:alanr [at] rmi]On Behalf Of Alan Robertson
Sent: Friday, March 24, 2000 5:36 AM
To: Hugo Martinez
Cc: linux-ha [at] muc; Rudy Pawul; geoff [at] gnaa
Subject: Re: Problem with heartBeat


Hugo Martinez wrote:
>
> Hi, I've just installed the HeartBeat-0.4.6d version on my machines
> (without the 'ipconfig' bug) but there is something that no works.
>
> I have two machines with one ethernet interface each one connected to a
> hub, then I've connected them by serial port to serial port over a null
> modem cable.
>
> When I run hearbeat program in the two machines it works perferct,
> the master takes a 'virtual IP' and if I stop heartbeat in the master,
> the secondary one takes the 'virtual IP', OK. If I disconnect the serial
> cable it keep on working OK, they use the eth0 to comunicate.
>
> But if I disconnect the master's crossover cable from eth0, the
> secondary machine doesn't take the IP, and if I ping to the 'virtual
> IP' the master can't response (because it's no connected) and the
> secondary one doesn't know that the master is disconnected, it doesn't
> work.
>
> Is this normal? Fake has to test the IP response? What can I do to
> repare it?
>
> Thanks.
> Hugo.

This is a well-known behavior of heartbeat at this point. Heartbeat
will take over ONLY on a dead machine, not on failure of ethernet, etc.
What has been suggested is that you use the "Mon" package to shut down
heartbeat when the ethernet dies. Mon has a ping-test module for
testing whether a particular IP address is reachable.

I'm not sure who has implemented this at this point. Geoff Nordli asked
about it, but I don't know if he has some Mon config files to show you,
or not.

I suppose that what we should do is get a sample Mon config file that
works, and document it in Rudy's docs.

Geoff? Do you have this working yet :-)

Rudy: Could you get this documented?

Rudy and I have been very distracted for the last month or two for
different reasons. Rudy's wife just had their first baby (congrats to
Rudy!), and I just started working for SuSE, and started helping get
SGI's FailSafe program available to the open source community.

-- Alan Robertson
alanr [at] suse

----------------------------------------------------------------------------
--
Linux HA Web Site:
http://linux-ha.org/
Linux HA HOWTO:

http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html
----------------------------------------------------------------------------
--


Ade.Rixon at rsi

Mar 30, 2000, 6:46 AM

Post #4 of 9 (650 views)
Permalink
Problem with heartBeat [In reply to]

--3uo+9/B/ebqu+fSQ
Content-Type: text/plain; charset=us-ascii

24 Mar 06:35:57 AM: Meanwhile in the Sheraton, Alan Robertson wrote:
> What has been suggested is that you use the "Mon" package to shut down
> heartbeat when the ethernet dies. Mon has a ping-test module for
> testing whether a particular IP address is reachable.
> I suppose that what we should do is get a sample Mon config file that
> works, and document it in Rudy's docs.
>-- End of excerpt from Alan Robertson

I have done some exploratory work on using Mon with our own HA product,
RSF-1, including an example config that performs network monitoring. It's
a bit rough 'n' ready (proof of concept only, it only logs messages rather
than causing failovers) but in case it's useful, I attach the configs I
used below together with an extract from my report that explains the idea.
It's fairly easy to do, once you get your head around Mon's config syntax.

Note that this is not anything remotely akin to an "official" release or
indication of plans to do anything further with Mon. If I had more
personal time, I'd be delighted to assist further with Linux-HA.
Unfortunately, this is not currently the case.

Cheers,
Ade_
/
--
| Ade Rixon, Consultant, RSi Solutions Ltd |
"Such a muddy line between
The things you want
And the things you have to do" - "Leaving Las Vegas", Sheryl Crow

--3uo+9/B/ebqu+fSQ
Content-Type: text/plain
Content-Disposition: attachment; filename="mon-report.txt"

Example usage

Assuming we have two RSF-1 servers with two network links and a number of
additional, independent network nodes by which we can reliably verify
connectivity, the strategy using Mon is:

* Each server pings all the independent nodes to check basic
connectivity.
* Server A pings server B's public, static address. This monitor only
executes if the basic connectivity monitor above is successful (i.e.
there is a dependency on it).
* When Server A fails to reach server B, it sends a remote alert (trap)
to B's private address.
* If B's basic connectivity check fails (dependency) and it receives a
trap from A, it assumes its public interface has failed and triggers an
alert that shuts down all running RSF-1 services.
* Optionally, A can also trigger an alert to put RSF-1 services into
automatic, ensuring that it will take them over.

Note that if A's interface fails or the entire network fails, service
shutdown will not occur because of the dependency on basic connectivity.

Somewhat confusingly, on pulling a network cable to a server, Mon logs
failures for both connectivity and the specific check for the other server,
even though only the latter generates an alert. The dep_behaviour config
keyword may alter this.

There is one issue with the example setup: fping returns a failure if any of
the remote nodes check are unreachable, which is not an effective test of
basic connectivity. We want the ping monitor to signal a failure only if all
the remote nodes are unreachable. The monitor script could probably be
hacked to implement this feature.

Ade Rixon,
RSi,
2000-03-03

--3uo+9/B/ebqu+fSQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="mon.cf.ip100"

#
# Example "mon.cf" configuration for "mon".
#
# $Id: example.cf,v 1.16 1999/08/16 00:25:06 trockij Exp $
#

#
# This works with 0.38pre8
#

#
# global options
#
cfbasedir = /var/tmp/mon/etc
basedir = /var/tmp/mon
alertdir = /var/tmp/mon/alert.d
mondir = /var/tmp/mon/mon.d
statedir = /var/tmp/mon/state.d
logdir = /var/tmp/mon/log.d
dtlogging = yes
dtlogfile = /var/tmp/mon/log.d/dtlog
maxprocs = 20
histlength = 100
randstart = 60s

#
# authentication types:
# getpwnam standard Unix passwd, NOT for shadow passwords
# shadow Unix shadow passwords (not implemented)
# userfile "mon" user file
#
authtype = userfile
userfile = /var/tmp/mon/etc/user.cf

#
# NB: hostgroup and watch entries are terminated with a blank line (or
# end of file). Don't forget the blank lines between them or you lose.
#

#
# group definitions (hostnames or IP addresses)
#
hostgroup cluster ip100 ip101 ip110

hostgroup node1 ip100

hostgroup node2 ip101

hostgroup node3 ip110

hostgroup wwwservers www

hostgroup network tom jerry router-fe0

#
# For the servers in building 1, monitor ping and telnet
# BOFH is on weekend call :)
#
watch wwwservers
service ping
description basic connectivity
interval 10s
monitor fping.monitor
period wd {Sun-Sat}
#alertevery 24h
numalerts 1
alert mail.alert root
upalert mail.alert -S "web server host online again" root
alertafter 3
service http
description HTTP response
interval 15s
monitor http.monitor
period wd {Sun-Sat}
#alertevery 24h
numalerts 1
alert mail.alert root
upalert mail.alert -S "HTTP service is back" root

watch network
service net
description "network connectivity"
interval 6s
monitor fping.monitor
period wd {Sun-Sat}
numalerts 1
alert file.alert /var/tmp/mon/log.d/alert.log -S "Lost network connectivity"
alertafter 2

watch node2
service net
description "node2 connectivity"
interval 12s
dep_behaviour m
depend network:net
monitor fping.monitor
period wd {Sun-Sat}
numalerts 1
alert remote.alert -H ip101-priv
alert file.alert /var/tmp/mon/log.d/alert.log -S "Lost connection to ip101"
alertafter 3


--3uo+9/B/ebqu+fSQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="mon.cf.ip101"

#
# Example "mon.cf" configuration for "mon".
#
# $Id: example.cf,v 1.16 1999/08/16 00:25:06 trockij Exp $
#

#
# This works with 0.38pre8
#

#
# global options
#
cfbasedir = /var/tmp/mon/etc
basedir = /var/tmp/mon
alertdir = /var/tmp/mon/alert.d
mondir = /var/tmp/mon/mon.d
statedir = /var/tmp/mon/state.d
logdir = /var/tmp/mon/log.d
dtlogging = yes
dtlogfile = /var/tmp/mon/log.d/dtlog
maxprocs = 20
histlength = 100
randstart = 60s

#
# authentication types:
# getpwnam standard Unix passwd, NOT for shadow passwords
# shadow Unix shadow passwords (not implemented)
# userfile "mon" user file
#
authtype = userfile
userfile = /var/tmp/mon/etc/user.cf

#
# NB: hostgroup and watch entries are terminated with a blank line (or
# end of file). Don't forget the blank lines between them or you lose.
#

#
# group definitions (hostnames or IP addresses)
#
hostgroup cluster ip100 ip101 ip110

hostgroup node1 ip100

hostgroup node2 ip101

hostgroup node3 ip110

hostgroup wwwservers www

hostgroup network tom jerry router-fe0

#
# For the servers in building 1, monitor ping and telnet
# BOFH is on weekend call :)
#
watch wwwservers
service ping
description basic connectivity
interval 10s
monitor fping.monitor
period wd {Sun-Sat}
#alertevery 24h
numalerts 1
alert mail.alert root
upalert mail.alert -S "web server host online again" root
alertafter 3
service http
description HTTP response
interval 15s
monitor http.monitor
period wd {Sun-Sat}
#alertevery 24h
numalerts 1
depend wwwservers:ping
alert mail.alert root
upalert mail.alert -S "HTTP service is back" root

watch network
service net
description "network connectivity"
interval 6s
monitor fping.monitor
period wd {Sun-Sat}
numalerts 1
alert mail.alert root
alert file.alert /var/tmp/mon/log.d/alert.log "Lost network connectivity"
alertafter 3

#
# catch traps from remote servers
#
watch node2
service net
description "TRAP: for ip101 connectivity"
period wd {Sun-Sat}
depend network:net
numalerts 1
alert mail.alert root
alert file.alert /var/tmp/mon/log.d/alert.log "Public network interface has failed"


--3uo+9/B/ebqu+fSQ--


dejanmm at fastmail

Oct 29, 2007, 5:53 AM

Post #5 of 9 (628 views)
Permalink
Re: Problem with Heartbeat [In reply to]

Hi,

On Sun, Oct 28, 2007 at 11:19:32PM -0300, welisson [at] conectcor wrote:
> Hi all.
>
>
> Following i have 2 servers, settings for function of firewall, with
> configuration.
>
> Server Master
> P4 3.0HT
> 2GB Ram
> 4 HD (2 used system and 2 to cache squid, firewall, Shaper and BGP-4)
> Motherboard Intel
>
>
> Server Slave
> P4 2.0
> 1GB Ram
> 2 HD
> Motherboard Intel without squid but used to firewall, shaper and BGP-4
>
> what it occurs is the following one, I have heartbeat installed in the
> two servers, and of some days for here, I am having problems with
> heartbeat of it to fall and to come back, as it follows in log below
> register in the main server:
>
>
> Oct 22 21:10:53 gateway heartbeat[19084]: WARN: Late heartbeat: Node
> gateway2.domain.com.br: interval 12530 ms
> Oct 22 22:20:37 gateway heartbeat[19084]: WARN: node
> gateway2.domain.com.br: is dead
> Oct 22 22:20:37 gateway heartbeat[19084]: WARN: No STONITH device
> configured.
> Oct 22 22:20:37 gateway heartbeat[19084]: WARN: Shared disks are not
> protected.
> Oct 22 22:20:37 gateway heartbeat[19084]: info: Resources being
> acquired from gateway2.domain.com.br.
> Oct 22 22:20:37 gateway heartbeat[19084]: info: Link
> gateway2.domain.com.br:/dev/ttyS0 dead.
> Oct 22 22:20:38 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
> status
> Oct 22 22:20:38 gateway heartbeat: info: /usr/lib/heartbeat/mach_down:
> nice_failback: foreign resources acquired
> Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Cluster node
> gateway2.domain.com.br returning after partition.
> Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Deadtime value may be
> too small.
> Oct 22 22:20:42 gateway heartbeat[19084]: info: See documentation for
> information on tuning deadtime.
> Oct 22 22:20:42 gateway heartbeat[19084]: info: Link
> gateway2.domain.com.br:/dev/ttyS0 up.
> Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Late heartbeat: Node
> gateway2.domain.com.br: interval 35790 ms

This indicates one of three possible problems: flakey
communications, high load, or a kernel scheduler problems.

Thanks,

Dejan

> Oct 22 22:20:42 gateway heartbeat[19084]: info: Status update for node
> gateway2.domain.com.br: status active
> Oct 22 22:20:42 gateway heartbeat[19084]: info: mach_down takeover complete.
> Oct 22 22:20:42 gateway heartbeat: info: mach_down takeover complete
> for node gateway2.domain.com.br.
> Oct 22 22:20:42 gateway heartbeat[14883]: info: Local Resource
> acquisition completed.
> Oct 22 22:20:42 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
> status
> Oct 22 22:20:44 gateway heartbeat[19084]: info: Heartbeat shutdown in
> progress. (19084)
> Oct 22 22:20:44 gateway heartbeat[16667]: info: Giving up all HA resources.
> Oct 22 22:20:44 gateway heartbeat: info: Releasing resource group:
> gateway.domain.com.br 200.xxx.xxx.xxx/30/eth0 200.xxx.xxx.x6/30/eth1
> 200.xxx.xxx.x7/29/eth2 firewall shaper
> Oct 22 22:20:44 gateway heartbeat: info: Running /etc/init.d/shaper stop
> Oct 22 22:20:46 gateway heartbeat: info: Running /etc/init.d/firewall stop
> Oct 22 22:20:46 gateway heartbeat: info: Running
> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x7/29/eth2 stop
> Oct 22 22:20:47 gateway heartbeat: info: Running
> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x6/30/eth1 stop
> Oct 22 22:20:47 gateway heartbeat: info: /sbin/route -n del -host
> 200.xxx.xxx.x6
> Oct 22 22:20:47 gateway heartbeat: info: /sbin/ifconfig eth1:0 down
> Oct 22 22:20:47 gateway heartbeat: info: IP Address 200.xxx.xxx.x6 released
> Oct 22 22:20:47 gateway heartbeat: info: Running
> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.xxx/30/eth0 stop
> Oct 22 22:20:47 gateway heartbeat[16667]: info: All HA resources
> relinquished.
> Oct 22 22:20:47 gateway heartbeat[19084]: WARN: 1 lost packet(s) for
> [gateway2.domain.com.br] [239455:239457]
> Oct 22 22:20:47 gateway heartbeat[19084]: info: No pkts missing from
> gateway2.domain.com.br!
> Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBFIFO process
> 19086 with signal 15
> Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBWRITE process
> 19087 with signal 15
> Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBREAD process
> 19088 with signal 15
> Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19088
> exited. 3 remaining
> Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19086
> exited. 2 remaining
> Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19087
> exited. 1 remaining
> Oct 22 22:20:48 gateway heartbeat[19084]: info: Heartbeat shutdown complete.
> Oct 22 22:20:48 gateway heartbeat[19084]: info: Heartbeat restart triggered.
> Oct 22 22:20:48 gateway heartbeat[19084]: info: Restarting heartbeat.
> Oct 22 22:20:48 gateway heartbeat[19084]: info: Performing heartbeat
> restart exec.
> Oct 22 22:21:19 gateway heartbeat[19084]: info: **************************
> Oct 22 22:21:19 gateway heartbeat[19084]: info: Configuration
> validated. Starting heartbeat 1.2.5
> Oct 22 22:21:19 gateway heartbeat[19947]: info: heartbeat: version 1.2.5
> Oct 22 22:21:19 gateway heartbeat[19947]: info: Heartbeat generation: 23
> Oct 22 22:21:20 gateway heartbeat[19947]: info: Starting serial
> heartbeat on tty /dev/ttyS0 (19200 baud)
> Oct 22 22:21:20 gateway heartbeat[19947]: info: pid 19947 locked in memory.
> Oct 22 22:21:20 gateway heartbeat[19947]: info: Local status now set to:
> 'up'
> Oct 22 22:21:21 gateway heartbeat[19949]: info: pid 19949 locked in memory.
> Oct 22 22:21:21 gateway heartbeat[19950]: info: pid 19950 locked in memory.
> Oct 22 22:21:21 gateway heartbeat[19951]: info: pid 19951 locked in memory.
> Oct 22 22:21:21 gateway heartbeat[19947]: WARN: string2msg_ll: node
> [gateway2.domain.com.br] failed authentication
> Oct 22 22:21:22 gateway heartbeat[19947]: info: Link
> gateway2.domain.com.br:/dev/ttyS0 up.
> Oct 22 22:21:22 gateway heartbeat[19947]: info: Status update for node
> gateway2.domain.com.br: status active
> Oct 22 22:21:22 gateway heartbeat[19947]: info: Local status now set
> to: 'active'
> Oct 22 22:21:22 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
> status
> Oct 22 22:21:22 gateway heartbeat[19947]: info: remote resource
> transition completed.
> Oct 22 22:21:22 gateway heartbeat[19947]: info: remote resource
> transition completed.
> Oct 22 22:21:22 gateway heartbeat[19947]: info: Local Resource
> acquisition completed. (none)
> Oct 22 22:21:23 gateway heartbeat[19947]: info: gateway2.domain.com.br
> wants to go standby [foreign]
> Oct 22 22:21:35 gateway heartbeat[19947]: info: standby: acquire
> [foreign] resources from gateway2.domain.com.br
> Oct 22 22:21:35 gateway heartbeat[19956]: info: acquire local HA
> resources (standby).
> Oct 22 22:21:35 gateway heartbeat: info: Acquiring resource group:
> gateway.domain.com.br 200.xxx.xxx.xxx/30/eth0 200.xxx.xxx.x6/30/eth1
> 200.xxx.xxx.x7/29/eth2 firewall shaper
> Oct 22 22:21:35 gateway heartbeat: info: Running
> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.xxx/30/eth0 start
> Oct 22 22:21:35 gateway heartbeat: info: /sbin/ifconfig eth0:0
> 200.xxx.xxx.xxx netmask 255.255.255.252 broadcast 200.208.220.131
> Oct 22 22:21:35 gateway heartbeat: info: Sending Gratuitous Arp for
> 200.xxx.xxx.xxx on eth0:0 [eth0]
> Oct 22 22:21:35 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
> -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.xxx
> eth0 200.xxx.xxx.xxx auto 200.xxx.xxx.xxx ffffffffffff
> Oct 22 22:21:35 gateway heartbeat: info: Running
> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x6/30/eth1 start
> Oct 22 22:21:35 gateway heartbeat: info: /sbin/ifconfig eth1:0
> 200.xxx.xxx.x6 netmask 255.255.255.252 broadcast 200.208.223.67
> Oct 22 22:21:35 gateway heartbeat: info: Sending Gratuitous Arp for
> 200.xxx.xxx.x6 on eth1:0 [eth1]
> Oct 22 22:21:35 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
> -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.x6 eth1
> 200.xxx.xxx.x6 auto 200.xxx.xxx.x6 ffffffffffff
> Oct 22 22:21:36 gateway heartbeat: info: Running
> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x7/29/eth2 start
> Oct 22 22:21:36 gateway heartbeat: info: /sbin/ifconfig eth2:0
> 200.xxx.xxx.x7 netmask 255.255.255.248 broadcast 200.208.220.151
> Oct 22 22:21:36 gateway heartbeat: info: Sending Gratuitous Arp for
> 200.xxx.xxx.x7 on eth2:0 [eth2]
> Oct 22 22:21:36 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
> -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.x7 eth2
> 200.xxx.xxx.x7 auto 200.xxx.xxx.x7 ffffffffffff
> Oct 22 22:21:36 gateway heartbeat: info: Running /etc/init.d/firewall start
> Oct 22 22:21:36 gateway heartbeat: info: Running /etc/init.d/shaper start
> Oct 22 22:21:41 gateway heartbeat[19956]: info: local HA resource
> acquisition completed (standby).
> Oct 22 22:21:41 gateway heartbeat[19947]: info: Standby resource
> acquisition done [foreign].
> Oct 22 22:21:41 gateway heartbeat[19947]: info: Initial resource
> acquisition complete (auto_failback)
> Oct 22 22:21:41 gateway heartbeat[19947]: info: remote resource
> transition completed.
>
> ----------------------------------------------------------------
> Conectcor - velocidade com qualidade
> www.conectcor.com.br
>
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


welisson at conectcor

Oct 29, 2007, 7:52 AM

Post #6 of 9 (629 views)
Permalink
Re: Problem with Heartbeat [In reply to]

But before it used Connective 10 whithout problem, later i formatted and
installed Debian 4 etch whithout no problem and it functioned normally,
it has the same performance of load that the Connective 10 had, of 2 weeks for
it comes here happening this.
Then load can until being, but she would perhaps be finishes it same option
because it also happens in calm schedule.

Regards

Welisson
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


welisson at conectcor

Nov 5, 2007, 7:31 AM

Post #7 of 9 (614 views)
Permalink
Re: Problem with Heartbeat [In reply to]

Hi all,

I am with the same problem, in relation to heartbeat, as it follows below in
the e-mail.
I tested handle, I increased the value of deadtime, and nothing it decided.
I would like to know, if this could be some problem in relation to kernel,
because I am using in the main o kernel 2.6.18, standard of debian etch, and
in Connective 10 (secondary) 2.6.12.2 compiled.

What it could be in relation to the Kernel?


Regards

Welisson


Em Seg 29 Out 2007 10:53, Dejan Muhamedagic escreveu:
> Hi,
>
> On Sun, Oct 28, 2007 at 11:19:32PM -0300, welisson [at] conectcor wrote:
> > Hi all.
> >
> >
> > Following i have 2 servers, settings for function of firewall, with
> > configuration.
> >
> > Server Master
> > P4 3.0HT
> > 2GB Ram
> > 4 HD (2 used system and 2 to cache squid, firewall, Shaper and BGP-4)
> > Motherboard Intel
> >
> >
> > Server Slave
> > P4 2.0
> > 1GB Ram
> > 2 HD
> > Motherboard Intel without squid but used to firewall, shaper and BGP-4
> >
> > what it occurs is the following one, I have heartbeat installed in the
> > two servers, and of some days for here, I am having problems with
> > heartbeat of it to fall and to come back, as it follows in log below
> > register in the main server:
> >
> >
> > Oct 22 21:10:53 gateway heartbeat[19084]: WARN: Late heartbeat: Node
> > gateway2.domain.com.br: interval 12530 ms
> > Oct 22 22:20:37 gateway heartbeat[19084]: WARN: node
> > gateway2.domain.com.br: is dead
> > Oct 22 22:20:37 gateway heartbeat[19084]: WARN: No STONITH device
> > configured.
> > Oct 22 22:20:37 gateway heartbeat[19084]: WARN: Shared disks are not
> > protected.
> > Oct 22 22:20:37 gateway heartbeat[19084]: info: Resources being
> > acquired from gateway2.domain.com.br.
> > Oct 22 22:20:37 gateway heartbeat[19084]: info: Link
> > gateway2.domain.com.br:/dev/ttyS0 dead.
> > Oct 22 22:20:38 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
> > status
> > Oct 22 22:20:38 gateway heartbeat: info: /usr/lib/heartbeat/mach_down:
> > nice_failback: foreign resources acquired
> > Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Cluster node
> > gateway2.domain.com.br returning after partition.
> > Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Deadtime value may be
> > too small.
> > Oct 22 22:20:42 gateway heartbeat[19084]: info: See documentation for
> > information on tuning deadtime.
> > Oct 22 22:20:42 gateway heartbeat[19084]: info: Link
> > gateway2.domain.com.br:/dev/ttyS0 up.
> > Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Late heartbeat: Node
> > gateway2.domain.com.br: interval 35790 ms
>
> This indicates one of three possible problems: flakey
> communications, high load, or a kernel scheduler problems.
>
> Thanks,
>
> Dejan
>
> > Oct 22 22:20:42 gateway heartbeat[19084]: info: Status update for node
> > gateway2.domain.com.br: status active
> > Oct 22 22:20:42 gateway heartbeat[19084]: info: mach_down takeover
> > complete. Oct 22 22:20:42 gateway heartbeat: info: mach_down takeover
> > complete for node gateway2.domain.com.br.
> > Oct 22 22:20:42 gateway heartbeat[14883]: info: Local Resource
> > acquisition completed.
> > Oct 22 22:20:42 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
> > status
> > Oct 22 22:20:44 gateway heartbeat[19084]: info: Heartbeat shutdown in
> > progress. (19084)
> > Oct 22 22:20:44 gateway heartbeat[16667]: info: Giving up all HA
> > resources. Oct 22 22:20:44 gateway heartbeat: info: Releasing resource
> > group: gateway.domain.com.br 200.xxx.xxx.xxx/30/eth0
> > 200.xxx.xxx.x6/30/eth1 200.xxx.xxx.x7/29/eth2 firewall shaper
> > Oct 22 22:20:44 gateway heartbeat: info: Running /etc/init.d/shaper stop
> > Oct 22 22:20:46 gateway heartbeat: info: Running /etc/init.d/firewall
> > stop Oct 22 22:20:46 gateway heartbeat: info: Running
> > /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x7/29/eth2 stop
> > Oct 22 22:20:47 gateway heartbeat: info: Running
> > /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x6/30/eth1 stop
> > Oct 22 22:20:47 gateway heartbeat: info: /sbin/route -n del -host
> > 200.xxx.xxx.x6
> > Oct 22 22:20:47 gateway heartbeat: info: /sbin/ifconfig eth1:0 down
> > Oct 22 22:20:47 gateway heartbeat: info: IP Address 200.xxx.xxx.x6
> > released Oct 22 22:20:47 gateway heartbeat: info: Running
> > /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.xxx/30/eth0 stop
> > Oct 22 22:20:47 gateway heartbeat[16667]: info: All HA resources
> > relinquished.
> > Oct 22 22:20:47 gateway heartbeat[19084]: WARN: 1 lost packet(s) for
> > [gateway2.domain.com.br] [239455:239457]
> > Oct 22 22:20:47 gateway heartbeat[19084]: info: No pkts missing from
> > gateway2.domain.com.br!
> > Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBFIFO process
> > 19086 with signal 15
> > Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBWRITE process
> > 19087 with signal 15
> > Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBREAD process
> > 19088 with signal 15
> > Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19088
> > exited. 3 remaining
> > Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19086
> > exited. 2 remaining
> > Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19087
> > exited. 1 remaining
> > Oct 22 22:20:48 gateway heartbeat[19084]: info: Heartbeat shutdown
> > complete. Oct 22 22:20:48 gateway heartbeat[19084]: info: Heartbeat
> > restart triggered. Oct 22 22:20:48 gateway heartbeat[19084]: info:
> > Restarting heartbeat. Oct 22 22:20:48 gateway heartbeat[19084]: info:
> > Performing heartbeat restart exec.
> > Oct 22 22:21:19 gateway heartbeat[19084]: info:
> > ************************** Oct 22 22:21:19 gateway heartbeat[19084]:
> > info: Configuration
> > validated. Starting heartbeat 1.2.5
> > Oct 22 22:21:19 gateway heartbeat[19947]: info: heartbeat: version 1.2.5
> > Oct 22 22:21:19 gateway heartbeat[19947]: info: Heartbeat generation: 23
> > Oct 22 22:21:20 gateway heartbeat[19947]: info: Starting serial
> > heartbeat on tty /dev/ttyS0 (19200 baud)
> > Oct 22 22:21:20 gateway heartbeat[19947]: info: pid 19947 locked in
> > memory. Oct 22 22:21:20 gateway heartbeat[19947]: info: Local status now
> > set to: 'up'
> > Oct 22 22:21:21 gateway heartbeat[19949]: info: pid 19949 locked in
> > memory. Oct 22 22:21:21 gateway heartbeat[19950]: info: pid 19950 locked
> > in memory. Oct 22 22:21:21 gateway heartbeat[19951]: info: pid 19951
> > locked in memory. Oct 22 22:21:21 gateway heartbeat[19947]: WARN:
> > string2msg_ll: node [gateway2.domain.com.br] failed authentication
> > Oct 22 22:21:22 gateway heartbeat[19947]: info: Link
> > gateway2.domain.com.br:/dev/ttyS0 up.
> > Oct 22 22:21:22 gateway heartbeat[19947]: info: Status update for node
> > gateway2.domain.com.br: status active
> > Oct 22 22:21:22 gateway heartbeat[19947]: info: Local status now set
> > to: 'active'
> > Oct 22 22:21:22 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
> > status
> > Oct 22 22:21:22 gateway heartbeat[19947]: info: remote resource
> > transition completed.
> > Oct 22 22:21:22 gateway heartbeat[19947]: info: remote resource
> > transition completed.
> > Oct 22 22:21:22 gateway heartbeat[19947]: info: Local Resource
> > acquisition completed. (none)
> > Oct 22 22:21:23 gateway heartbeat[19947]: info: gateway2.domain.com.br
> > wants to go standby [foreign]
> > Oct 22 22:21:35 gateway heartbeat[19947]: info: standby: acquire
> > [foreign] resources from gateway2.domain.com.br
> > Oct 22 22:21:35 gateway heartbeat[19956]: info: acquire local HA
> > resources (standby).
> > Oct 22 22:21:35 gateway heartbeat: info: Acquiring resource group:
> > gateway.domain.com.br 200.xxx.xxx.xxx/30/eth0 200.xxx.xxx.x6/30/eth1
> > 200.xxx.xxx.x7/29/eth2 firewall shaper
> > Oct 22 22:21:35 gateway heartbeat: info: Running
> > /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.xxx/30/eth0 start
> > Oct 22 22:21:35 gateway heartbeat: info: /sbin/ifconfig eth0:0
> > 200.xxx.xxx.xxx netmask 255.255.255.252 broadcast 200.208.220.131
> > Oct 22 22:21:35 gateway heartbeat: info: Sending Gratuitous Arp for
> > 200.xxx.xxx.xxx on eth0:0 [eth0]
> > Oct 22 22:21:35 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
> > -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.xxx
> > eth0 200.xxx.xxx.xxx auto 200.xxx.xxx.xxx ffffffffffff
> > Oct 22 22:21:35 gateway heartbeat: info: Running
> > /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x6/30/eth1 start
> > Oct 22 22:21:35 gateway heartbeat: info: /sbin/ifconfig eth1:0
> > 200.xxx.xxx.x6 netmask 255.255.255.252 broadcast 200.208.223.67
> > Oct 22 22:21:35 gateway heartbeat: info: Sending Gratuitous Arp for
> > 200.xxx.xxx.x6 on eth1:0 [eth1]
> > Oct 22 22:21:35 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
> > -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.x6 eth1
> > 200.xxx.xxx.x6 auto 200.xxx.xxx.x6 ffffffffffff
> > Oct 22 22:21:36 gateway heartbeat: info: Running
> > /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x7/29/eth2 start
> > Oct 22 22:21:36 gateway heartbeat: info: /sbin/ifconfig eth2:0
> > 200.xxx.xxx.x7 netmask 255.255.255.248 broadcast 200.208.220.151
> > Oct 22 22:21:36 gateway heartbeat: info: Sending Gratuitous Arp for
> > 200.xxx.xxx.x7 on eth2:0 [eth2]
> > Oct 22 22:21:36 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
> > -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.x7 eth2
> > 200.xxx.xxx.x7 auto 200.xxx.xxx.x7 ffffffffffff
> > Oct 22 22:21:36 gateway heartbeat: info: Running /etc/init.d/firewall
> > start Oct 22 22:21:36 gateway heartbeat: info: Running /etc/init.d/shaper
> > start Oct 22 22:21:41 gateway heartbeat[19956]: info: local HA resource
> > acquisition completed (standby).
> > Oct 22 22:21:41 gateway heartbeat[19947]: info: Standby resource
> > acquisition done [foreign].
> > Oct 22 22:21:41 gateway heartbeat[19947]: info: Initial resource
> > acquisition complete (auto_failback)
> > Oct 22 22:21:41 gateway heartbeat[19947]: info: remote resource
> > transition completed.
> >
> > ----------------------------------------------------------------
> > Conectcor - velocidade com qualidade
> > www.conectcor.com.br
> >
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA [at] lists
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


yan at fitterer

Nov 5, 2007, 7:36 AM

Post #8 of 9 (602 views)
Permalink
Re: Problem with Heartbeat [In reply to]

Welisson wrote:
> Hi all,
>
> I am with the same problem, in relation to heartbeat, as it follows below in
> the e-mail.
> I tested handle, I increased the value of deadtime, and nothing it decided.
> I would like to know, if this could be some problem in relation to kernel,
> because I am using in the main o kernel 2.6.18, standard of debian etch, and
> in Connective 10 (secondary) 2.6.12.2 compiled.
>
> What it could be in relation to the Kernel?

As Dejan said already:

> This indicates one of three possible problems: flakey
> communications, high load, or a kernel scheduler problems.

So - Yes, it _could_ be a kernel issue. Have you ruled out the other two
possible causes? If not, you should probably start there (as they are
typically easier to identify / fix, and, if relevant, they MUST be fixed
if you want a stable cluster anyway). If comms are clean and load is not
the problem, then re-visit kernel issue.


>
>
> Regards
>
> Welisson
>
>
> Em Seg 29 Out 2007 10:53, Dejan Muhamedagic escreveu:
>> Hi,
>>
>> On Sun, Oct 28, 2007 at 11:19:32PM -0300, welisson [at] conectcor wrote:
>>> Hi all.
>>>
>>>
>>> Following i have 2 servers, settings for function of firewall, with
>>> configuration.
>>>
>>> Server Master
>>> P4 3.0HT
>>> 2GB Ram
>>> 4 HD (2 used system and 2 to cache squid, firewall, Shaper and BGP-4)
>>> Motherboard Intel
>>>
>>>
>>> Server Slave
>>> P4 2.0
>>> 1GB Ram
>>> 2 HD
>>> Motherboard Intel without squid but used to firewall, shaper and BGP-4
>>>
>>> what it occurs is the following one, I have heartbeat installed in the
>>> two servers, and of some days for here, I am having problems with
>>> heartbeat of it to fall and to come back, as it follows in log below
>>> register in the main server:
>>>
>>>
>>> Oct 22 21:10:53 gateway heartbeat[19084]: WARN: Late heartbeat: Node
>>> gateway2.domain.com.br: interval 12530 ms
>>> Oct 22 22:20:37 gateway heartbeat[19084]: WARN: node
>>> gateway2.domain.com.br: is dead
>>> Oct 22 22:20:37 gateway heartbeat[19084]: WARN: No STONITH device
>>> configured.
>>> Oct 22 22:20:37 gateway heartbeat[19084]: WARN: Shared disks are not
>>> protected.
>>> Oct 22 22:20:37 gateway heartbeat[19084]: info: Resources being
>>> acquired from gateway2.domain.com.br.
>>> Oct 22 22:20:37 gateway heartbeat[19084]: info: Link
>>> gateway2.domain.com.br:/dev/ttyS0 dead.
>>> Oct 22 22:20:38 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
>>> status
>>> Oct 22 22:20:38 gateway heartbeat: info: /usr/lib/heartbeat/mach_down:
>>> nice_failback: foreign resources acquired
>>> Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Cluster node
>>> gateway2.domain.com.br returning after partition.
>>> Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Deadtime value may be
>>> too small.
>>> Oct 22 22:20:42 gateway heartbeat[19084]: info: See documentation for
>>> information on tuning deadtime.
>>> Oct 22 22:20:42 gateway heartbeat[19084]: info: Link
>>> gateway2.domain.com.br:/dev/ttyS0 up.
>>> Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Late heartbeat: Node
>>> gateway2.domain.com.br: interval 35790 ms
>> This indicates one of three possible problems: flakey
>> communications, high load, or a kernel scheduler problems.
>>
>> Thanks,
>>
>> Dejan
>>
>>> Oct 22 22:20:42 gateway heartbeat[19084]: info: Status update for node
>>> gateway2.domain.com.br: status active
>>> Oct 22 22:20:42 gateway heartbeat[19084]: info: mach_down takeover
>>> complete. Oct 22 22:20:42 gateway heartbeat: info: mach_down takeover
>>> complete for node gateway2.domain.com.br.
>>> Oct 22 22:20:42 gateway heartbeat[14883]: info: Local Resource
>>> acquisition completed.
>>> Oct 22 22:20:42 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
>>> status
>>> Oct 22 22:20:44 gateway heartbeat[19084]: info: Heartbeat shutdown in
>>> progress. (19084)
>>> Oct 22 22:20:44 gateway heartbeat[16667]: info: Giving up all HA
>>> resources. Oct 22 22:20:44 gateway heartbeat: info: Releasing resource
>>> group: gateway.domain.com.br 200.xxx.xxx.xxx/30/eth0
>>> 200.xxx.xxx.x6/30/eth1 200.xxx.xxx.x7/29/eth2 firewall shaper
>>> Oct 22 22:20:44 gateway heartbeat: info: Running /etc/init.d/shaper stop
>>> Oct 22 22:20:46 gateway heartbeat: info: Running /etc/init.d/firewall
>>> stop Oct 22 22:20:46 gateway heartbeat: info: Running
>>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x7/29/eth2 stop
>>> Oct 22 22:20:47 gateway heartbeat: info: Running
>>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x6/30/eth1 stop
>>> Oct 22 22:20:47 gateway heartbeat: info: /sbin/route -n del -host
>>> 200.xxx.xxx.x6
>>> Oct 22 22:20:47 gateway heartbeat: info: /sbin/ifconfig eth1:0 down
>>> Oct 22 22:20:47 gateway heartbeat: info: IP Address 200.xxx.xxx.x6
>>> released Oct 22 22:20:47 gateway heartbeat: info: Running
>>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.xxx/30/eth0 stop
>>> Oct 22 22:20:47 gateway heartbeat[16667]: info: All HA resources
>>> relinquished.
>>> Oct 22 22:20:47 gateway heartbeat[19084]: WARN: 1 lost packet(s) for
>>> [gateway2.domain.com.br] [239455:239457]
>>> Oct 22 22:20:47 gateway heartbeat[19084]: info: No pkts missing from
>>> gateway2.domain.com.br!
>>> Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBFIFO process
>>> 19086 with signal 15
>>> Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBWRITE process
>>> 19087 with signal 15
>>> Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBREAD process
>>> 19088 with signal 15
>>> Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19088
>>> exited. 3 remaining
>>> Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19086
>>> exited. 2 remaining
>>> Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19087
>>> exited. 1 remaining
>>> Oct 22 22:20:48 gateway heartbeat[19084]: info: Heartbeat shutdown
>>> complete. Oct 22 22:20:48 gateway heartbeat[19084]: info: Heartbeat
>>> restart triggered. Oct 22 22:20:48 gateway heartbeat[19084]: info:
>>> Restarting heartbeat. Oct 22 22:20:48 gateway heartbeat[19084]: info:
>>> Performing heartbeat restart exec.
>>> Oct 22 22:21:19 gateway heartbeat[19084]: info:
>>> ************************** Oct 22 22:21:19 gateway heartbeat[19084]:
>>> info: Configuration
>>> validated. Starting heartbeat 1.2.5
>>> Oct 22 22:21:19 gateway heartbeat[19947]: info: heartbeat: version 1.2.5
>>> Oct 22 22:21:19 gateway heartbeat[19947]: info: Heartbeat generation: 23
>>> Oct 22 22:21:20 gateway heartbeat[19947]: info: Starting serial
>>> heartbeat on tty /dev/ttyS0 (19200 baud)
>>> Oct 22 22:21:20 gateway heartbeat[19947]: info: pid 19947 locked in
>>> memory. Oct 22 22:21:20 gateway heartbeat[19947]: info: Local status now
>>> set to: 'up'
>>> Oct 22 22:21:21 gateway heartbeat[19949]: info: pid 19949 locked in
>>> memory. Oct 22 22:21:21 gateway heartbeat[19950]: info: pid 19950 locked
>>> in memory. Oct 22 22:21:21 gateway heartbeat[19951]: info: pid 19951
>>> locked in memory. Oct 22 22:21:21 gateway heartbeat[19947]: WARN:
>>> string2msg_ll: node [gateway2.domain.com.br] failed authentication
>>> Oct 22 22:21:22 gateway heartbeat[19947]: info: Link
>>> gateway2.domain.com.br:/dev/ttyS0 up.
>>> Oct 22 22:21:22 gateway heartbeat[19947]: info: Status update for node
>>> gateway2.domain.com.br: status active
>>> Oct 22 22:21:22 gateway heartbeat[19947]: info: Local status now set
>>> to: 'active'
>>> Oct 22 22:21:22 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
>>> status
>>> Oct 22 22:21:22 gateway heartbeat[19947]: info: remote resource
>>> transition completed.
>>> Oct 22 22:21:22 gateway heartbeat[19947]: info: remote resource
>>> transition completed.
>>> Oct 22 22:21:22 gateway heartbeat[19947]: info: Local Resource
>>> acquisition completed. (none)
>>> Oct 22 22:21:23 gateway heartbeat[19947]: info: gateway2.domain.com.br
>>> wants to go standby [foreign]
>>> Oct 22 22:21:35 gateway heartbeat[19947]: info: standby: acquire
>>> [foreign] resources from gateway2.domain.com.br
>>> Oct 22 22:21:35 gateway heartbeat[19956]: info: acquire local HA
>>> resources (standby).
>>> Oct 22 22:21:35 gateway heartbeat: info: Acquiring resource group:
>>> gateway.domain.com.br 200.xxx.xxx.xxx/30/eth0 200.xxx.xxx.x6/30/eth1
>>> 200.xxx.xxx.x7/29/eth2 firewall shaper
>>> Oct 22 22:21:35 gateway heartbeat: info: Running
>>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.xxx/30/eth0 start
>>> Oct 22 22:21:35 gateway heartbeat: info: /sbin/ifconfig eth0:0
>>> 200.xxx.xxx.xxx netmask 255.255.255.252 broadcast 200.208.220.131
>>> Oct 22 22:21:35 gateway heartbeat: info: Sending Gratuitous Arp for
>>> 200.xxx.xxx.xxx on eth0:0 [eth0]
>>> Oct 22 22:21:35 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
>>> -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.xxx
>>> eth0 200.xxx.xxx.xxx auto 200.xxx.xxx.xxx ffffffffffff
>>> Oct 22 22:21:35 gateway heartbeat: info: Running
>>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x6/30/eth1 start
>>> Oct 22 22:21:35 gateway heartbeat: info: /sbin/ifconfig eth1:0
>>> 200.xxx.xxx.x6 netmask 255.255.255.252 broadcast 200.208.223.67
>>> Oct 22 22:21:35 gateway heartbeat: info: Sending Gratuitous Arp for
>>> 200.xxx.xxx.x6 on eth1:0 [eth1]
>>> Oct 22 22:21:35 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
>>> -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.x6 eth1
>>> 200.xxx.xxx.x6 auto 200.xxx.xxx.x6 ffffffffffff
>>> Oct 22 22:21:36 gateway heartbeat: info: Running
>>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x7/29/eth2 start
>>> Oct 22 22:21:36 gateway heartbeat: info: /sbin/ifconfig eth2:0
>>> 200.xxx.xxx.x7 netmask 255.255.255.248 broadcast 200.208.220.151
>>> Oct 22 22:21:36 gateway heartbeat: info: Sending Gratuitous Arp for
>>> 200.xxx.xxx.x7 on eth2:0 [eth2]
>>> Oct 22 22:21:36 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
>>> -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.x7 eth2
>>> 200.xxx.xxx.x7 auto 200.xxx.xxx.x7 ffffffffffff
>>> Oct 22 22:21:36 gateway heartbeat: info: Running /etc/init.d/firewall
>>> start Oct 22 22:21:36 gateway heartbeat: info: Running /etc/init.d/shaper
>>> start Oct 22 22:21:41 gateway heartbeat[19956]: info: local HA resource
>>> acquisition completed (standby).
>>> Oct 22 22:21:41 gateway heartbeat[19947]: info: Standby resource
>>> acquisition done [foreign].
>>> Oct 22 22:21:41 gateway heartbeat[19947]: info: Initial resource
>>> acquisition complete (auto_failback)
>>> Oct 22 22:21:41 gateway heartbeat[19947]: info: remote resource
>>> transition completed.
>>>
>>> ----------------------------------------------------------------
>>> Conectcor - velocidade com qualidade
>>> www.conectcor.com.br
>>>
>>>
>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA [at] lists
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA [at] lists
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


welisson at conectcor

Nov 5, 2007, 9:27 AM

Post #9 of 9 (601 views)
Permalink
Re: Problem with Heartbeat [In reply to]

thanks
i will go to see the version of mine kernel, and I will compile another
version kernel.

Regards

Welisson

Em Seg 05 Nov 2007 13:36, Yan Fitterer escreveu:
> Welisson wrote:
> > Hi all,
> >
> > I am with the same problem, in relation to heartbeat, as it follows below
> > in the e-mail.
> > I tested handle, I increased the value of deadtime, and nothing it
> > decided. I would like to know, if this could be some problem in relation
> > to kernel, because I am using in the main o kernel 2.6.18, standard of
> > debian etch, and in Connective 10 (secondary) 2.6.12.2 compiled.
> >
> > What it could be in relation to the Kernel?
>
> As Dejan said already:
> > This indicates one of three possible problems: flakey
> > communications, high load, or a kernel scheduler problems.
>
> So - Yes, it _could_ be a kernel issue. Have you ruled out the other two
> possible causes? If not, you should probably start there (as they are
> typically easier to identify / fix, and, if relevant, they MUST be fixed
> if you want a stable cluster anyway). If comms are clean and load is not
> the problem, then re-visit kernel issue.
>
> > Regards
> >
> > Welisson
> >
> > Em Seg 29 Out 2007 10:53, Dejan Muhamedagic escreveu:
> >> Hi,
> >>
> >> On Sun, Oct 28, 2007 at 11:19:32PM -0300, welisson [at] conectcor
wrote:
> >>> Hi all.
> >>>
> >>>
> >>> Following i have 2 servers, settings for function of firewall, with
> >>> configuration.
> >>>
> >>> Server Master
> >>> P4 3.0HT
> >>> 2GB Ram
> >>> 4 HD (2 used system and 2 to cache squid, firewall, Shaper and BGP-4)
> >>> Motherboard Intel
> >>>
> >>>
> >>> Server Slave
> >>> P4 2.0
> >>> 1GB Ram
> >>> 2 HD
> >>> Motherboard Intel without squid but used to firewall, shaper and BGP-4
> >>>
> >>> what it occurs is the following one, I have heartbeat installed in the
> >>> two servers, and of some days for here, I am having problems with
> >>> heartbeat of it to fall and to come back, as it follows in log below
> >>> register in the main server:
> >>>
> >>>
> >>> Oct 22 21:10:53 gateway heartbeat[19084]: WARN: Late heartbeat: Node
> >>> gateway2.domain.com.br: interval 12530 ms
> >>> Oct 22 22:20:37 gateway heartbeat[19084]: WARN: node
> >>> gateway2.domain.com.br: is dead
> >>> Oct 22 22:20:37 gateway heartbeat[19084]: WARN: No STONITH device
> >>> configured.
> >>> Oct 22 22:20:37 gateway heartbeat[19084]: WARN: Shared disks are not
> >>> protected.
> >>> Oct 22 22:20:37 gateway heartbeat[19084]: info: Resources being
> >>> acquired from gateway2.domain.com.br.
> >>> Oct 22 22:20:37 gateway heartbeat[19084]: info: Link
> >>> gateway2.domain.com.br:/dev/ttyS0 dead.
> >>> Oct 22 22:20:38 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
> >>> status
> >>> Oct 22 22:20:38 gateway heartbeat: info: /usr/lib/heartbeat/mach_down:
> >>> nice_failback: foreign resources acquired
> >>> Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Cluster node
> >>> gateway2.domain.com.br returning after partition.
> >>> Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Deadtime value may be
> >>> too small.
> >>> Oct 22 22:20:42 gateway heartbeat[19084]: info: See documentation for
> >>> information on tuning deadtime.
> >>> Oct 22 22:20:42 gateway heartbeat[19084]: info: Link
> >>> gateway2.domain.com.br:/dev/ttyS0 up.
> >>> Oct 22 22:20:42 gateway heartbeat[19084]: WARN: Late heartbeat: Node
> >>> gateway2.domain.com.br: interval 35790 ms
> >>
> >> This indicates one of three possible problems: flakey
> >> communications, high load, or a kernel scheduler problems.
> >>
> >> Thanks,
> >>
> >> Dejan
> >>
> >>> Oct 22 22:20:42 gateway heartbeat[19084]: info: Status update for node
> >>> gateway2.domain.com.br: status active
> >>> Oct 22 22:20:42 gateway heartbeat[19084]: info: mach_down takeover
> >>> complete. Oct 22 22:20:42 gateway heartbeat: info: mach_down takeover
> >>> complete for node gateway2.domain.com.br.
> >>> Oct 22 22:20:42 gateway heartbeat[14883]: info: Local Resource
> >>> acquisition completed.
> >>> Oct 22 22:20:42 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
> >>> status
> >>> Oct 22 22:20:44 gateway heartbeat[19084]: info: Heartbeat shutdown in
> >>> progress. (19084)
> >>> Oct 22 22:20:44 gateway heartbeat[16667]: info: Giving up all HA
> >>> resources. Oct 22 22:20:44 gateway heartbeat: info: Releasing resource
> >>> group: gateway.domain.com.br 200.xxx.xxx.xxx/30/eth0
> >>> 200.xxx.xxx.x6/30/eth1 200.xxx.xxx.x7/29/eth2 firewall shaper
> >>> Oct 22 22:20:44 gateway heartbeat: info: Running /etc/init.d/shaper
> >>> stop Oct 22 22:20:46 gateway heartbeat: info: Running
> >>> /etc/init.d/firewall stop Oct 22 22:20:46 gateway heartbeat: info:
> >>> Running
> >>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x7/29/eth2 stop
> >>> Oct 22 22:20:47 gateway heartbeat: info: Running
> >>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x6/30/eth1 stop
> >>> Oct 22 22:20:47 gateway heartbeat: info: /sbin/route -n del -host
> >>> 200.xxx.xxx.x6
> >>> Oct 22 22:20:47 gateway heartbeat: info: /sbin/ifconfig eth1:0 down
> >>> Oct 22 22:20:47 gateway heartbeat: info: IP Address 200.xxx.xxx.x6
> >>> released Oct 22 22:20:47 gateway heartbeat: info: Running
> >>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.xxx/30/eth0 stop
> >>> Oct 22 22:20:47 gateway heartbeat[16667]: info: All HA resources
> >>> relinquished.
> >>> Oct 22 22:20:47 gateway heartbeat[19084]: WARN: 1 lost packet(s) for
> >>> [gateway2.domain.com.br] [239455:239457]
> >>> Oct 22 22:20:47 gateway heartbeat[19084]: info: No pkts missing from
> >>> gateway2.domain.com.br!
> >>> Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBFIFO process
> >>> 19086 with signal 15
> >>> Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBWRITE process
> >>> 19087 with signal 15
> >>> Oct 22 22:20:48 gateway heartbeat[19084]: info: killing HBREAD process
> >>> 19088 with signal 15
> >>> Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19088
> >>> exited. 3 remaining
> >>> Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19086
> >>> exited. 2 remaining
> >>> Oct 22 22:20:48 gateway heartbeat[19084]: info: Core process 19087
> >>> exited. 1 remaining
> >>> Oct 22 22:20:48 gateway heartbeat[19084]: info: Heartbeat shutdown
> >>> complete. Oct 22 22:20:48 gateway heartbeat[19084]: info: Heartbeat
> >>> restart triggered. Oct 22 22:20:48 gateway heartbeat[19084]: info:
> >>> Restarting heartbeat. Oct 22 22:20:48 gateway heartbeat[19084]: info:
> >>> Performing heartbeat restart exec.
> >>> Oct 22 22:21:19 gateway heartbeat[19084]: info:
> >>> ************************** Oct 22 22:21:19 gateway heartbeat[19084]:
> >>> info: Configuration
> >>> validated. Starting heartbeat 1.2.5
> >>> Oct 22 22:21:19 gateway heartbeat[19947]: info: heartbeat: version
> >>> 1.2.5 Oct 22 22:21:19 gateway heartbeat[19947]: info: Heartbeat
> >>> generation: 23 Oct 22 22:21:20 gateway heartbeat[19947]: info: Starting
> >>> serial heartbeat on tty /dev/ttyS0 (19200 baud)
> >>> Oct 22 22:21:20 gateway heartbeat[19947]: info: pid 19947 locked in
> >>> memory. Oct 22 22:21:20 gateway heartbeat[19947]: info: Local status
> >>> now set to: 'up'
> >>> Oct 22 22:21:21 gateway heartbeat[19949]: info: pid 19949 locked in
> >>> memory. Oct 22 22:21:21 gateway heartbeat[19950]: info: pid 19950
> >>> locked in memory. Oct 22 22:21:21 gateway heartbeat[19951]: info: pid
> >>> 19951 locked in memory. Oct 22 22:21:21 gateway heartbeat[19947]: WARN:
> >>> string2msg_ll: node [gateway2.domain.com.br] failed authentication Oct
> >>> 22 22:21:22 gateway heartbeat[19947]: info: Link
> >>> gateway2.domain.com.br:/dev/ttyS0 up.
> >>> Oct 22 22:21:22 gateway heartbeat[19947]: info: Status update for node
> >>> gateway2.domain.com.br: status active
> >>> Oct 22 22:21:22 gateway heartbeat[19947]: info: Local status now set
> >>> to: 'active'
> >>> Oct 22 22:21:22 gateway heartbeat: info: Running /etc/ha.d/rc.d/status
> >>> status
> >>> Oct 22 22:21:22 gateway heartbeat[19947]: info: remote resource
> >>> transition completed.
> >>> Oct 22 22:21:22 gateway heartbeat[19947]: info: remote resource
> >>> transition completed.
> >>> Oct 22 22:21:22 gateway heartbeat[19947]: info: Local Resource
> >>> acquisition completed. (none)
> >>> Oct 22 22:21:23 gateway heartbeat[19947]: info: gateway2.domain.com.br
> >>> wants to go standby [foreign]
> >>> Oct 22 22:21:35 gateway heartbeat[19947]: info: standby: acquire
> >>> [foreign] resources from gateway2.domain.com.br
> >>> Oct 22 22:21:35 gateway heartbeat[19956]: info: acquire local HA
> >>> resources (standby).
> >>> Oct 22 22:21:35 gateway heartbeat: info: Acquiring resource group:
> >>> gateway.domain.com.br 200.xxx.xxx.xxx/30/eth0 200.xxx.xxx.x6/30/eth1
> >>> 200.xxx.xxx.x7/29/eth2 firewall shaper
> >>> Oct 22 22:21:35 gateway heartbeat: info: Running
> >>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.xxx/30/eth0 start
> >>> Oct 22 22:21:35 gateway heartbeat: info: /sbin/ifconfig eth0:0
> >>> 200.xxx.xxx.xxx netmask 255.255.255.252 broadcast 200.208.220.131
> >>> Oct 22 22:21:35 gateway heartbeat: info: Sending Gratuitous Arp for
> >>> 200.xxx.xxx.xxx on eth0:0 [eth0]
> >>> Oct 22 22:21:35 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
> >>> -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.xxx
> >>> eth0 200.xxx.xxx.xxx auto 200.xxx.xxx.xxx ffffffffffff
> >>> Oct 22 22:21:35 gateway heartbeat: info: Running
> >>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x6/30/eth1 start
> >>> Oct 22 22:21:35 gateway heartbeat: info: /sbin/ifconfig eth1:0
> >>> 200.xxx.xxx.x6 netmask 255.255.255.252 broadcast 200.208.223.67
> >>> Oct 22 22:21:35 gateway heartbeat: info: Sending Gratuitous Arp for
> >>> 200.xxx.xxx.x6 on eth1:0 [eth1]
> >>> Oct 22 22:21:35 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
> >>> -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.x6 eth1
> >>> 200.xxx.xxx.x6 auto 200.xxx.xxx.x6 ffffffffffff
> >>> Oct 22 22:21:36 gateway heartbeat: info: Running
> >>> /etc/ha.d/resource.d/IPaddr 200.xxx.xxx.x7/29/eth2 start
> >>> Oct 22 22:21:36 gateway heartbeat: info: /sbin/ifconfig eth2:0
> >>> 200.xxx.xxx.x7 netmask 255.255.255.248 broadcast 200.208.220.151
> >>> Oct 22 22:21:36 gateway heartbeat: info: Sending Gratuitous Arp for
> >>> 200.xxx.xxx.x7 on eth2:0 [eth2]
> >>> Oct 22 22:21:36 gateway heartbeat: /usr/lib/heartbeat/send_arp -i 1010
> >>> -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-200.xxx.xxx.x7 eth2
> >>> 200.xxx.xxx.x7 auto 200.xxx.xxx.x7 ffffffffffff
> >>> Oct 22 22:21:36 gateway heartbeat: info: Running /etc/init.d/firewall
> >>> start Oct 22 22:21:36 gateway heartbeat: info: Running
> >>> /etc/init.d/shaper start Oct 22 22:21:41 gateway heartbeat[19956]:
> >>> info: local HA resource acquisition completed (standby).
> >>> Oct 22 22:21:41 gateway heartbeat[19947]: info: Standby resource
> >>> acquisition done [foreign].
> >>> Oct 22 22:21:41 gateway heartbeat[19947]: info: Initial resource
> >>> acquisition complete (auto_failback)
> >>> Oct 22 22:21:41 gateway heartbeat[19947]: info: remote resource
> >>> transition completed.
> >>>
> >>> ----------------------------------------------------------------
> >>> Conectcor - velocidade com qualidade
> >>> www.conectcor.com.br
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Linux-HA mailing list
> >>> Linux-HA [at] lists
> >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>> See also: http://linux-ha.org/ReportingProblems
> >>
> >> _______________________________________________
> >> Linux-HA mailing list
> >> Linux-HA [at] lists
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA [at] lists
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.