
dejanmm at fastmail
Jun 22, 2009, 3:31 AM
Post #3 of 4
(300 views)
Permalink
|
Hi, On Mon, Jun 22, 2009 at 08:52:49AM +0200, Sebastian Vieira wrote: > Hi, > > We have a pair of heartbeat nodes, Balancer1 and 2, that have their NICs > setup as follows: > > eth0 and eth2 are bonded as bond0 for the user LAN > eth1 and eth3 are bonded as bond1 for the heartbeats > > Between eth1 on ha01 and ha02 there's a cross cable. The link between the > two eth3's runs over a switch. > > Ipfail is configured to ping the default gw which is reachable over bond0. > > In a test we unplugged the cable from eth1 on Balancer2, waited about 10s > and reinstalled it. Then we did the same with eth3. After that, ipfail > started to display some weird messages. We did the same with eth0 and eth2 > but that went fine: > > Jun 18 13:54:25 Balancer2 ipfail: [10570]: info: Ping node count is > balanced. > Jun 18 13:55:36 Balancer2 kernel: bnx2: eth1 NIC Copper Link is Down > Jun 18 13:55:36 Balancer2 kernel: bonding: bond1: link status definitely > down for interface eth1, disabling it > Jun 18 13:55:54 Balancer2 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 > Mbps full duplex, receive & transmit flow control ON > Jun 18 13:55:54 Balancer2 kernel: bonding: bond1: link status definitely up > for interface eth1. > Jun 18 13:56:06 Balancer2 kernel: bnx2: eth3 NIC Copper Link is Down > Jun 18 13:56:06 Balancer2 kernel: bonding: bond1: link status definitely > down for interface eth3, disabling it > Jun 18 13:56:06 Balancer2 kernel: bonding: bond1: making interface eth1 the > new active one. > Jun 18 13:56:16 Balancer2 ipfail: [10570]: info: Status update: Node > Balancer1.amg.local now has status dead > Jun 18 13:56:17 Balancer2 ipfail: [10570]: info: NS: We are still alive! > Jun 18 13:56:17 Balancer2 ipfail: [10570]: info: Link Status update: Link > Balancer1.amg.local/bond1 now has status dead > Jun 18 13:56:19 Balancer2 ipfail: [10570]: info: Asking other side for ping > node count. > Jun 18 13:56:19 Balancer2 ipfail: [10570]: info: Checking remote count of > ping nodes. > Jun 18 13:56:20 Balancer2 kernel: bnx2: eth3 NIC Copper Link is Up, 100 Mbps > full duplex > Jun 18 13:56:20 Balancer2 kernel: bonding: bond1: link status definitely up > for interface eth3. > Jun 18 13:56:26 Balancer2 kernel: bnx2: eth0 NIC Copper Link is Down > Jun 18 13:56:26 Balancer2 kernel: bonding: bond0: link status definitely > down for interface eth0, disabling it > Jun 18 13:56:26 Balancer2 kernel: bonding: bond0: making interface eth2 the > new active one. > Jun 18 13:56:39 Balancer2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 > Mbps full duplex > Jun 18 13:56:39 Balancer2 kernel: bonding: bond0: link status definitely up > for interface eth0. > Jun 18 13:57:03 Balancer2 kernel: bnx2: eth2 NIC Copper Link is Down > Jun 18 13:57:03 Balancer2 kernel: bonding: bond0: link status definitely > down for interface eth2, disabling it > Jun 18 13:57:03 Balancer2 kernel: bonding: bond0: making interface eth0 the > new active one. > Jun 18 13:57:21 Balancer2 kernel: bnx2: eth2 NIC Copper Link is Up, 1000 > Mbps full duplex > Jun 18 13:57:21 Balancer2 kernel: bonding: bond0: link status definitely up > for interface eth2. > > This is from Balancer1: > > Jun 18 13:54:13 Balancer1 logd: [7710]: info: logd started with default > configuration. > Jun 18 13:54:13 Balancer1 logd: [7711]: info: G_main_add_SignalHandler: > Added signal handler for signal 15 > Jun 18 13:54:13 Balancer1 logd: [7710]: info: G_main_add_SignalHandler: > Added signal handler for signal 15 > Jun 18 13:54:21 Balancer1 ipfail: [7816]: info: Status update: Node > Balancer2.amg.local now has status active > Jun 18 13:54:22 Balancer1 ipfail: [7816]: info: Asking other side for ping > node count. > Jun 18 13:54:25 Balancer1 ipfail: [7816]: info: No giveup timer to abort. > Jun 18 13:55:36 Balancer1 kernel: bnx2: eth1 NIC Copper Link is Down > Jun 18 13:55:36 Balancer1 kernel: bonding: bond1: link status definitely > down for interface eth1, disabling it > Jun 18 13:55:54 Balancer1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 > Mbps full duplex, receive & transmit flow control ON > Jun 18 13:55:54 Balancer1 kernel: bonding: bond1: link status definitely up > for interface eth1. > Jun 18 13:56:16 Balancer1 ipfail: [7816]: info: Status update: Node > Balancer2.amg.local now has status dead > Jun 18 13:56:16 Balancer1 ipfail: [7816]: info: NS: We are still alive! > Jun 18 13:56:17 Balancer1 ipfail: [7816]: info: Link Status update: Link > Balancer2.amg.local/bond1 now has status dead > Jun 18 13:56:18 Balancer1 ipfail: [7816]: info: Asking other side for ping > node count. > Jun 18 13:56:18 Balancer1 ipfail: [7816]: info: Checking remote count of > ping nodes. > > Now, everything went okay and a failover didn't occur but the ipfail > messages are strange; why suddenly declare the node as dead after more than > 20s after both slaves of bond1 were up again? It doesn't make sense (to me). > And why did ipfail immediately said 'NS: We are still alive!'. > > Anyone? Strange. For one thing I would expect the bonding driver to mask these events. ipfail receives events through a callback mechanism from heartbeat, but we don't see heartbeat complaining either. NS stands for node status. Did you try to turn debug on, perhaps there would be more information. Thanks, Dejan > Regards, > > Sebastian > _______________________________________________ > Linux-HA mailing list > Linux-HA[at]lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA[at]lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|