Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

strange ipfail message

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


sebvieira at gmail

Jun 21, 2009, 11:52 PM

Post #1 of 4 (328 views)
Permalink
strange ipfail message

Hi,

We have a pair of heartbeat nodes, Balancer1 and 2, that have their NICs
setup as follows:

eth0 and eth2 are bonded as bond0 for the user LAN
eth1 and eth3 are bonded as bond1 for the heartbeats

Between eth1 on ha01 and ha02 there's a cross cable. The link between the
two eth3's runs over a switch.

Ipfail is configured to ping the default gw which is reachable over bond0.

In a test we unplugged the cable from eth1 on Balancer2, waited about 10s
and reinstalled it. Then we did the same with eth3. After that, ipfail
started to display some weird messages. We did the same with eth0 and eth2
but that went fine:

Jun 18 13:54:25 Balancer2 ipfail: [10570]: info: Ping node count is
balanced.
Jun 18 13:55:36 Balancer2 kernel: bnx2: eth1 NIC Copper Link is Down
Jun 18 13:55:36 Balancer2 kernel: bonding: bond1: link status definitely
down for interface eth1, disabling it
Jun 18 13:55:54 Balancer2 kernel: bnx2: eth1 NIC Copper Link is Up, 1000
Mbps full duplex, receive & transmit flow control ON
Jun 18 13:55:54 Balancer2 kernel: bonding: bond1: link status definitely up
for interface eth1.
Jun 18 13:56:06 Balancer2 kernel: bnx2: eth3 NIC Copper Link is Down
Jun 18 13:56:06 Balancer2 kernel: bonding: bond1: link status definitely
down for interface eth3, disabling it
Jun 18 13:56:06 Balancer2 kernel: bonding: bond1: making interface eth1 the
new active one.
Jun 18 13:56:16 Balancer2 ipfail: [10570]: info: Status update: Node
Balancer1.amg.local now has status dead
Jun 18 13:56:17 Balancer2 ipfail: [10570]: info: NS: We are still alive!
Jun 18 13:56:17 Balancer2 ipfail: [10570]: info: Link Status update: Link
Balancer1.amg.local/bond1 now has status dead
Jun 18 13:56:19 Balancer2 ipfail: [10570]: info: Asking other side for ping
node count.
Jun 18 13:56:19 Balancer2 ipfail: [10570]: info: Checking remote count of
ping nodes.
Jun 18 13:56:20 Balancer2 kernel: bnx2: eth3 NIC Copper Link is Up, 100 Mbps
full duplex
Jun 18 13:56:20 Balancer2 kernel: bonding: bond1: link status definitely up
for interface eth3.
Jun 18 13:56:26 Balancer2 kernel: bnx2: eth0 NIC Copper Link is Down
Jun 18 13:56:26 Balancer2 kernel: bonding: bond0: link status definitely
down for interface eth0, disabling it
Jun 18 13:56:26 Balancer2 kernel: bonding: bond0: making interface eth2 the
new active one.
Jun 18 13:56:39 Balancer2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000
Mbps full duplex
Jun 18 13:56:39 Balancer2 kernel: bonding: bond0: link status definitely up
for interface eth0.
Jun 18 13:57:03 Balancer2 kernel: bnx2: eth2 NIC Copper Link is Down
Jun 18 13:57:03 Balancer2 kernel: bonding: bond0: link status definitely
down for interface eth2, disabling it
Jun 18 13:57:03 Balancer2 kernel: bonding: bond0: making interface eth0 the
new active one.
Jun 18 13:57:21 Balancer2 kernel: bnx2: eth2 NIC Copper Link is Up, 1000
Mbps full duplex
Jun 18 13:57:21 Balancer2 kernel: bonding: bond0: link status definitely up
for interface eth2.

This is from Balancer1:

Jun 18 13:54:13 Balancer1 logd: [7710]: info: logd started with default
configuration.
Jun 18 13:54:13 Balancer1 logd: [7711]: info: G_main_add_SignalHandler:
Added signal handler for signal 15
Jun 18 13:54:13 Balancer1 logd: [7710]: info: G_main_add_SignalHandler:
Added signal handler for signal 15
Jun 18 13:54:21 Balancer1 ipfail: [7816]: info: Status update: Node
Balancer2.amg.local now has status active
Jun 18 13:54:22 Balancer1 ipfail: [7816]: info: Asking other side for ping
node count.
Jun 18 13:54:25 Balancer1 ipfail: [7816]: info: No giveup timer to abort.
Jun 18 13:55:36 Balancer1 kernel: bnx2: eth1 NIC Copper Link is Down
Jun 18 13:55:36 Balancer1 kernel: bonding: bond1: link status definitely
down for interface eth1, disabling it
Jun 18 13:55:54 Balancer1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000
Mbps full duplex, receive & transmit flow control ON
Jun 18 13:55:54 Balancer1 kernel: bonding: bond1: link status definitely up
for interface eth1.
Jun 18 13:56:16 Balancer1 ipfail: [7816]: info: Status update: Node
Balancer2.amg.local now has status dead
Jun 18 13:56:16 Balancer1 ipfail: [7816]: info: NS: We are still alive!
Jun 18 13:56:17 Balancer1 ipfail: [7816]: info: Link Status update: Link
Balancer2.amg.local/bond1 now has status dead
Jun 18 13:56:18 Balancer1 ipfail: [7816]: info: Asking other side for ping
node count.
Jun 18 13:56:18 Balancer1 ipfail: [7816]: info: Checking remote count of
ping nodes.

Now, everything went okay and a failover didn't occur but the ipfail
messages are strange; why suddenly declare the node as dead after more than
20s after both slaves of bond1 were up again? It doesn't make sense (to me).
And why did ipfail immediately said 'NS: We are still alive!'.

Anyone?

Regards,

Sebastian
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


sebvieira at gmail

Jun 22, 2009, 1:43 AM

Post #2 of 4 (301 views)
Permalink
Re: strange ipfail message [In reply to]

Just noticed this mail got send out again. First one got bounced (loop
forever) but apparently still made it through. Apologies for the extra
mails. Was not meant to be a 'bump' of some sort.

S.

On Mon, Jun 22, 2009 at 8:52 AM, Sebastian Vieira <sebvieira[at]gmail.com>wrote:

> Hi,
>
> We have a pair of heartbeat nodes, Balancer1 and 2, that have their NICs
> setup as follows:
>
> eth0 and eth2 are bonded as bond0 for the user LAN
> eth1 and eth3 are bonded as bond1 for the heartbeats
>
> Between eth1 on ha01 and ha02 there's a cross cable. The link between the
> two eth3's runs over a switch.
>
> Ipfail is configured to ping the default gw which is reachable over bond0.
>
> In a test we unplugged the cable from eth1 on Balancer2, waited about 10s
> and reinstalled it. Then we did the same with eth3. After that, ipfail
> started to display some weird messages. We did the same with eth0 and eth2
> but that went fine:
>
> Jun 18 13:54:25 Balancer2 ipfail: [10570]: info: Ping node count is
> balanced.
> Jun 18 13:55:36 Balancer2 kernel: bnx2: eth1 NIC Copper Link is Down
> Jun 18 13:55:36 Balancer2 kernel: bonding: bond1: link status definitely
> down for interface eth1, disabling it
> Jun 18 13:55:54 Balancer2 kernel: bnx2: eth1 NIC Copper Link is Up, 1000
> Mbps full duplex, receive & transmit flow control ON
> Jun 18 13:55:54 Balancer2 kernel: bonding: bond1: link status definitely up
> for interface eth1.
> Jun 18 13:56:06 Balancer2 kernel: bnx2: eth3 NIC Copper Link is Down
> Jun 18 13:56:06 Balancer2 kernel: bonding: bond1: link status definitely
> down for interface eth3, disabling it
> Jun 18 13:56:06 Balancer2 kernel: bonding: bond1: making interface eth1 the
> new active one.
> Jun 18 13:56:16 Balancer2 ipfail: [10570]: info: Status update: Node
> Balancer1.amg.local now has status dead
> Jun 18 13:56:17 Balancer2 ipfail: [10570]: info: NS: We are still alive!
> Jun 18 13:56:17 Balancer2 ipfail: [10570]: info: Link Status update: Link
> Balancer1.amg.local/bond1 now has status dead
> Jun 18 13:56:19 Balancer2 ipfail: [10570]: info: Asking other side for ping
> node count.
> Jun 18 13:56:19 Balancer2 ipfail: [10570]: info: Checking remote count of
> ping nodes.
> Jun 18 13:56:20 Balancer2 kernel: bnx2: eth3 NIC Copper Link is Up, 100
> Mbps full duplex
> Jun 18 13:56:20 Balancer2 kernel: bonding: bond1: link status definitely up
> for interface eth3.
> Jun 18 13:56:26 Balancer2 kernel: bnx2: eth0 NIC Copper Link is Down
> Jun 18 13:56:26 Balancer2 kernel: bonding: bond0: link status definitely
> down for interface eth0, disabling it
> Jun 18 13:56:26 Balancer2 kernel: bonding: bond0: making interface eth2 the
> new active one.
> Jun 18 13:56:39 Balancer2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000
> Mbps full duplex
> Jun 18 13:56:39 Balancer2 kernel: bonding: bond0: link status definitely up
> for interface eth0.
> Jun 18 13:57:03 Balancer2 kernel: bnx2: eth2 NIC Copper Link is Down
> Jun 18 13:57:03 Balancer2 kernel: bonding: bond0: link status definitely
> down for interface eth2, disabling it
> Jun 18 13:57:03 Balancer2 kernel: bonding: bond0: making interface eth0 the
> new active one.
> Jun 18 13:57:21 Balancer2 kernel: bnx2: eth2 NIC Copper Link is Up, 1000
> Mbps full duplex
> Jun 18 13:57:21 Balancer2 kernel: bonding: bond0: link status definitely up
> for interface eth2.
>
> This is from Balancer1:
>
> Jun 18 13:54:13 Balancer1 logd: [7710]: info: logd started with default
> configuration.
> Jun 18 13:54:13 Balancer1 logd: [7711]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Jun 18 13:54:13 Balancer1 logd: [7710]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Jun 18 13:54:21 Balancer1 ipfail: [7816]: info: Status update: Node
> Balancer2.amg.local now has status active
> Jun 18 13:54:22 Balancer1 ipfail: [7816]: info: Asking other side for ping
> node count.
> Jun 18 13:54:25 Balancer1 ipfail: [7816]: info: No giveup timer to abort.
> Jun 18 13:55:36 Balancer1 kernel: bnx2: eth1 NIC Copper Link is Down
> Jun 18 13:55:36 Balancer1 kernel: bonding: bond1: link status definitely
> down for interface eth1, disabling it
> Jun 18 13:55:54 Balancer1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000
> Mbps full duplex, receive & transmit flow control ON
> Jun 18 13:55:54 Balancer1 kernel: bonding: bond1: link status definitely up
> for interface eth1.
> Jun 18 13:56:16 Balancer1 ipfail: [7816]: info: Status update: Node
> Balancer2.amg.local now has status dead
> Jun 18 13:56:16 Balancer1 ipfail: [7816]: info: NS: We are still alive!
> Jun 18 13:56:17 Balancer1 ipfail: [7816]: info: Link Status update: Link
> Balancer2.amg.local/bond1 now has status dead
> Jun 18 13:56:18 Balancer1 ipfail: [7816]: info: Asking other side for ping
> node count.
> Jun 18 13:56:18 Balancer1 ipfail: [7816]: info: Checking remote count of
> ping nodes.
>
> Now, everything went okay and a failover didn't occur but the ipfail
> messages are strange; why suddenly declare the node as dead after more than
> 20s after both slaves of bond1 were up again? It doesn't make sense (to me).
> And why did ipfail immediately said 'NS: We are still alive!'.
>
> Anyone?
>
> Regards,
>
> Sebastian
>
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Jun 22, 2009, 3:31 AM

Post #3 of 4 (300 views)
Permalink
Re: strange ipfail message [In reply to]

Hi,

On Mon, Jun 22, 2009 at 08:52:49AM +0200, Sebastian Vieira wrote:
> Hi,
>
> We have a pair of heartbeat nodes, Balancer1 and 2, that have their NICs
> setup as follows:
>
> eth0 and eth2 are bonded as bond0 for the user LAN
> eth1 and eth3 are bonded as bond1 for the heartbeats
>
> Between eth1 on ha01 and ha02 there's a cross cable. The link between the
> two eth3's runs over a switch.
>
> Ipfail is configured to ping the default gw which is reachable over bond0.
>
> In a test we unplugged the cable from eth1 on Balancer2, waited about 10s
> and reinstalled it. Then we did the same with eth3. After that, ipfail
> started to display some weird messages. We did the same with eth0 and eth2
> but that went fine:
>
> Jun 18 13:54:25 Balancer2 ipfail: [10570]: info: Ping node count is
> balanced.
> Jun 18 13:55:36 Balancer2 kernel: bnx2: eth1 NIC Copper Link is Down
> Jun 18 13:55:36 Balancer2 kernel: bonding: bond1: link status definitely
> down for interface eth1, disabling it
> Jun 18 13:55:54 Balancer2 kernel: bnx2: eth1 NIC Copper Link is Up, 1000
> Mbps full duplex, receive & transmit flow control ON
> Jun 18 13:55:54 Balancer2 kernel: bonding: bond1: link status definitely up
> for interface eth1.
> Jun 18 13:56:06 Balancer2 kernel: bnx2: eth3 NIC Copper Link is Down
> Jun 18 13:56:06 Balancer2 kernel: bonding: bond1: link status definitely
> down for interface eth3, disabling it
> Jun 18 13:56:06 Balancer2 kernel: bonding: bond1: making interface eth1 the
> new active one.
> Jun 18 13:56:16 Balancer2 ipfail: [10570]: info: Status update: Node
> Balancer1.amg.local now has status dead
> Jun 18 13:56:17 Balancer2 ipfail: [10570]: info: NS: We are still alive!
> Jun 18 13:56:17 Balancer2 ipfail: [10570]: info: Link Status update: Link
> Balancer1.amg.local/bond1 now has status dead
> Jun 18 13:56:19 Balancer2 ipfail: [10570]: info: Asking other side for ping
> node count.
> Jun 18 13:56:19 Balancer2 ipfail: [10570]: info: Checking remote count of
> ping nodes.
> Jun 18 13:56:20 Balancer2 kernel: bnx2: eth3 NIC Copper Link is Up, 100 Mbps
> full duplex
> Jun 18 13:56:20 Balancer2 kernel: bonding: bond1: link status definitely up
> for interface eth3.
> Jun 18 13:56:26 Balancer2 kernel: bnx2: eth0 NIC Copper Link is Down
> Jun 18 13:56:26 Balancer2 kernel: bonding: bond0: link status definitely
> down for interface eth0, disabling it
> Jun 18 13:56:26 Balancer2 kernel: bonding: bond0: making interface eth2 the
> new active one.
> Jun 18 13:56:39 Balancer2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000
> Mbps full duplex
> Jun 18 13:56:39 Balancer2 kernel: bonding: bond0: link status definitely up
> for interface eth0.
> Jun 18 13:57:03 Balancer2 kernel: bnx2: eth2 NIC Copper Link is Down
> Jun 18 13:57:03 Balancer2 kernel: bonding: bond0: link status definitely
> down for interface eth2, disabling it
> Jun 18 13:57:03 Balancer2 kernel: bonding: bond0: making interface eth0 the
> new active one.
> Jun 18 13:57:21 Balancer2 kernel: bnx2: eth2 NIC Copper Link is Up, 1000
> Mbps full duplex
> Jun 18 13:57:21 Balancer2 kernel: bonding: bond0: link status definitely up
> for interface eth2.
>
> This is from Balancer1:
>
> Jun 18 13:54:13 Balancer1 logd: [7710]: info: logd started with default
> configuration.
> Jun 18 13:54:13 Balancer1 logd: [7711]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Jun 18 13:54:13 Balancer1 logd: [7710]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Jun 18 13:54:21 Balancer1 ipfail: [7816]: info: Status update: Node
> Balancer2.amg.local now has status active
> Jun 18 13:54:22 Balancer1 ipfail: [7816]: info: Asking other side for ping
> node count.
> Jun 18 13:54:25 Balancer1 ipfail: [7816]: info: No giveup timer to abort.
> Jun 18 13:55:36 Balancer1 kernel: bnx2: eth1 NIC Copper Link is Down
> Jun 18 13:55:36 Balancer1 kernel: bonding: bond1: link status definitely
> down for interface eth1, disabling it
> Jun 18 13:55:54 Balancer1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000
> Mbps full duplex, receive & transmit flow control ON
> Jun 18 13:55:54 Balancer1 kernel: bonding: bond1: link status definitely up
> for interface eth1.
> Jun 18 13:56:16 Balancer1 ipfail: [7816]: info: Status update: Node
> Balancer2.amg.local now has status dead
> Jun 18 13:56:16 Balancer1 ipfail: [7816]: info: NS: We are still alive!
> Jun 18 13:56:17 Balancer1 ipfail: [7816]: info: Link Status update: Link
> Balancer2.amg.local/bond1 now has status dead
> Jun 18 13:56:18 Balancer1 ipfail: [7816]: info: Asking other side for ping
> node count.
> Jun 18 13:56:18 Balancer1 ipfail: [7816]: info: Checking remote count of
> ping nodes.
>
> Now, everything went okay and a failover didn't occur but the ipfail
> messages are strange; why suddenly declare the node as dead after more than
> 20s after both slaves of bond1 were up again? It doesn't make sense (to me).
> And why did ipfail immediately said 'NS: We are still alive!'.
>
> Anyone?

Strange. For one thing I would expect the bonding driver to mask
these events. ipfail receives events through a callback mechanism
from heartbeat, but we don't see heartbeat complaining either. NS
stands for node status. Did you try to turn debug on, perhaps
there would be more information.

Thanks,

Dejan

> Regards,
>
> Sebastian
> _______________________________________________
> Linux-HA mailing list
> Linux-HA[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


sebvieira at gmail

Jun 30, 2009, 12:00 AM

Post #4 of 4 (246 views)
Permalink
Re: strange ipfail message [In reply to]

On Mon, Jun 22, 2009 at 12:31 PM, Dejan Muhamedagic <dejanmm[at]fastmail.fm>wrote:

> Strange. For one thing I would expect the bonding driver to mask
> these events. ipfail receives events through a callback mechanism
> from heartbeat, but we don't see heartbeat complaining either. NS
> stands for node status. Did you try to turn debug on, perhaps
> there would be more information.
>

Hi,

Thanks for your reply and sorry for my late follow-up; been a little busy
with work/kid/otherstuff. Anyway, there's no debug loginfo when this 'error'
occurs.

regards,

Sebastian
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.