
lars.ellenberg at linbit
Oct 2, 2009, 2:11 AM
Post #5 of 5
(893 views)
Permalink
|
On Fri, Oct 02, 2009 at 10:21:53AM +0200, James Brackinshaw wrote: > On Fri, Oct 2, 2009 at 10:17 AM, Lars Ellenberg > <lars.ellenberg [at] linbit> wrote: > > On Wed, Sep 30, 2009 at 02:22:32PM +0200, James Brackinshaw wrote: > >> Hello, > >> > >> I have a two node heartbeat setup on Centos 5.3. > >> > >> The two nodes are in separate locations and connected only via > >> ethernet. Because of this we require that a human guarantee that a > >> node is dead before a switchover occurs. We use meatclient for this. > >> > >> Automatic failback is turned off. We would like the primary node to do > >> all of the work unless we manually switch roles or the primary node > >> dies. > >> > >> We recently had a network outage. We expected that the primary node > >> would stay active and providing services. Instead, the two nodes > >> switched roles while the network was being repaired. > >> > >> I cannot understand how the role switching happened since we ran no > >> scripts manually (at least not at the start), and did not run > >> meatclient. > >> > >> Can anyone help me understand why this happened? > > > > Connectivity came back. > > > >> I attach my log files. > > > > I did not have a look. > > > > But you are likely to find > > WARN node whatever-xy: is dead > > ... > > CRIT: Cluster node whatever-xy returning after partition. > > WARN: Deadtime value may be too small > > ... > > > > this is handled by the cluster software by stopping all resources, > > then starting on the "preferred" node. > > > > Thanks Lars. The first node is preferred. Services are on the first > node to start with, after the split it migrated the services to the > second node (which should not happen without a meatclient > confirmation) and then back again. We used meatclient to avoid the > situation, so what did we do wrong? Then my explanation was not quite aplicable in this particular situation, and your old heartbeat stuff does handle it all a bit different, probably both nodes scheduling a shutdown and restart once they recognized the rejoin after split, but since one was already in the process of taking over resources, just waiting for the confirmation of the other node being dead, it "deferred" that shutdown until "current resource activity finished". If you read your logs slowly, comparing time stamps, it will explain what exactly happened. I don't think you did something wrong. Its just that heartbeat has not much options to clean up the mess. "meatware" is NOT there to prevent failovers, but to confirm reset operations. After the rejoin, there was nothing to confirm anymore, as both nodes were able to talk to each other again. Heartbeat (haresources) is not very flexible when handling rejoins after partitions. Pacemaker may or may not be able to handle such a situation more to your liking, if configured appropriately. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Linux-HA mailing list Linux-HA [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|