Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

Unexpected nice_failback behaviour

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


keith at midnighthax

Nov 10, 2004, 4:19 AM

Post #1 of 4 (748 views)
Permalink
Unexpected nice_failback behaviour

I have a two node cluster with nodes alpha and bravo. Alpha is configured
as the preferred node in haresources. nice_failback is on.

If I stop heartbeat on whichever is the master, the cluster fails over. If
I restart heartbeat the cluster does not fail back, and that is as
expected.

If I have bravo as the master, and I pull the network cables out of bravo,
the cluster fails over to alpha. If I replace the network cables, the
cluster does not fail back, and that is as expected.

If I have alpha as the master and remove the network cables from alpha, the
cluster fails over to bravo. However, if I then replace the network cables,
the cluster fails back to alpha. This is not what I expected.

Am I expecting the wrong thing? Or have I configured something incorrectly
somewhere?

Thanks,
Keith
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha


alanr at unix

Nov 10, 2004, 7:49 AM

Post #2 of 4 (716 views)
Permalink
Re: Unexpected nice_failback behaviour [In reply to]

Keith Edmunds wrote:
> I have a two node cluster with nodes alpha and bravo. Alpha is configured
> as the preferred node in haresources. nice_failback is on.
>
> If I stop heartbeat on whichever is the master, the cluster fails over. If
> I restart heartbeat the cluster does not fail back, and that is as
> expected.
>
> If I have bravo as the master, and I pull the network cables out of bravo,
> the cluster fails over to alpha. If I replace the network cables, the
> cluster does not fail back, and that is as expected.

If you pull all communications links out of bravo, then you've created a
split-brain. Putting them back should cause both systems to restart
heartbeat and reacquire resources.

> If I have alpha as the master and remove the network cables from alpha, the
> cluster fails over to bravo. However, if I then replace the network cables,
> the cluster fails back to alpha. This is not what I expected.

See comment above.

> Am I expecting the wrong thing? Or have I configured something incorrectly
> somewhere?

First, having your ha.cf configuration, the version of heartbeat (probably
an older one) you're using and being more precise about what cables you
pulled and which ones you left in place (if any) would be good.

Upgrading to 1.2.x would also be good. 1.2.3 is the current version. We
like it.

--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me claim
from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha


keith at midnighthax

Nov 10, 2004, 9:18 AM

Post #3 of 4 (710 views)
Permalink
Re: Unexpected nice_failback behaviour [In reply to]

On Wed, 10 Nov 2004 08:49:46 -0700
Alan Robertson <alanr [at] unix> wrote:

> If you pull all communications links out of bravo, then you've created a
> split-brain. Putting them back should cause both systems to restart
> heartbeat and reacquire resources.

That makes sense, thanks.

> First, having your ha.cf configuration, the version of heartbeat
> (probably an older one) you're using and being more precise about what
> cables you pulled and which ones you left in place (if any) would be
> good.

Setup: alpha and bravo are directors, dual NICs, "public" subnet
10.0.0.0/24. Real servers are charlie and delta, subnet 192.168.6.0/24.

ha.cf, decommented:

===========================
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
warntime 10
initdead 120
nice_failback on
udpport 694
bcast eth0 eth1
mcast eth0 225.0.0.1 694 1 0
ucast eth0 192.168.6.2
node alpha
node bravo
respawn hacluster /usr/lib/heartbeat/ipfail
ping 192.168.6.4 192.168.6.5
===========================

Heartbeat: 1.0.4-1.woody.

Cables pulled: both NICs on alpha while alpha is master. Cluster fails
over, successfully, but fails back when cables replaced.

> Upgrading to 1.2.x would also be good. 1.2.3 is the current version. We
> like it.

I was using the Debian package, but I can compile a later heartbeat by
hand if that will help.

Thanks for your help, it's appreciated.

Keith
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha


alanr at unix

Nov 10, 2004, 10:33 AM

Post #4 of 4 (714 views)
Permalink
Re: Unexpected nice_failback behaviour [In reply to]

Keith Edmunds wrote:
> On Wed, 10 Nov 2004 08:49:46 -0700
> Alan Robertson <alanr [at] unix> wrote:
>
>
>>If you pull all communications links out of bravo, then you've created a
>>split-brain. Putting them back should cause both systems to restart
>>heartbeat and reacquire resources.
>
>
> That makes sense, thanks.


Since you pulled all your cables, then the restart will cause all resources
to be reacquired.

This is a unpredictable in terms of who runs what resources due to timing
issues.

Split-brain is a bad thing. Best avoided when you can - because it can
damage or lose information - particularly when you're using shared storage.

Multiple redundant heartbeat connections, using ipfail (with ping nodes),
and configuring STONITH make for a good configuration.

Upgrading to 1.2.x won't make things a lot better - but there are a number
of mainly-subtle bug fixes, and some nice features that come with 1.2.3.

I think horms has 1.2.3 debian packages around. See the download directory
for the location.




--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me claim
from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.