Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

Two node cluster switchback

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


jbrackinshaw at gmail

Sep 30, 2009, 5:22 AM

Post #1 of 5 (957 views)
Permalink
Two node cluster switchback

Hello,

I have a two node heartbeat setup on Centos 5.3.

The two nodes are in separate locations and connected only via
ethernet. Because of this we require that a human guarantee that a
node is dead before a switchover occurs. We use meatclient for this.

Automatic failback is turned off. We would like the primary node to do
all of the work unless we manually switch roles or the primary node
dies.

We recently had a network outage. We expected that the primary node
would stay active and providing services. Instead, the two nodes
switched roles while the network was being repaired.

I cannot understand how the role switching happened since we ran no
scripts manually (at least not at the start), and did not run
meatclient.

Can anyone help me understand why this happened?

I attach my log files.

Thanks,

JB
Attachments: box01.ha.cf (0.32 KB)
  box02.ha.cf (0.32 KB)
  box01.haresources (96 B)
  box02.haresources (96 B)
  box01.ha.log (21.2 KB)
  box02.ha.log (19.5 KB)


jbrackinshaw at gmail

Oct 2, 2009, 1:04 AM

Post #2 of 5 (903 views)
Permalink
Re: Two node cluster switchback [In reply to]

Well this seems to normally work: HELP ME! :=)
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lars.ellenberg at linbit

Oct 2, 2009, 1:17 AM

Post #3 of 5 (899 views)
Permalink
Re: Two node cluster switchback [In reply to]

On Wed, Sep 30, 2009 at 02:22:32PM +0200, James Brackinshaw wrote:
> Hello,
>
> I have a two node heartbeat setup on Centos 5.3.
>
> The two nodes are in separate locations and connected only via
> ethernet. Because of this we require that a human guarantee that a
> node is dead before a switchover occurs. We use meatclient for this.
>
> Automatic failback is turned off. We would like the primary node to do
> all of the work unless we manually switch roles or the primary node
> dies.
>
> We recently had a network outage. We expected that the primary node
> would stay active and providing services. Instead, the two nodes
> switched roles while the network was being repaired.
>
> I cannot understand how the role switching happened since we ran no
> scripts manually (at least not at the start), and did not run
> meatclient.
>
> Can anyone help me understand why this happened?

Connectivity came back.

> I attach my log files.

I did not have a look.

But you are likely to find
WARN node whatever-xy: is dead
...
CRIT: Cluster node whatever-xy returning after partition.
WARN: Deadtime value may be too small
...

this is handled by the cluster software by stopping all resources,
then starting on the "preferred" node.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


jbrackinshaw at gmail

Oct 2, 2009, 1:21 AM

Post #4 of 5 (905 views)
Permalink
Re: Two node cluster switchback [In reply to]

On Fri, Oct 2, 2009 at 10:17 AM, Lars Ellenberg
<lars.ellenberg [at] linbit> wrote:
> On Wed, Sep 30, 2009 at 02:22:32PM +0200, James Brackinshaw wrote:
>> Hello,
>>
>> I have a two node heartbeat setup on Centos 5.3.
>>
>> The two nodes are in separate locations and connected only via
>> ethernet. Because of this we require that a human guarantee that a
>> node is dead before a switchover occurs. We use meatclient for this.
>>
>> Automatic failback is turned off. We would like the primary node to do
>> all of the work unless we manually switch roles or the primary node
>> dies.
>>
>> We recently had a network outage. We expected that the primary node
>> would stay active and providing services. Instead, the two nodes
>> switched roles while the network was being repaired.
>>
>> I cannot understand how the role switching happened since we ran no
>> scripts manually (at least not at the start), and did not run
>> meatclient.
>>
>> Can anyone help me understand why this happened?
>
> Connectivity came back.
>
>> I attach my log files.
>
> I did not have a look.
>
> But you are likely to find
>  WARN node whatever-xy: is dead
>  ...
>  CRIT: Cluster node whatever-xy returning after partition.
>  WARN: Deadtime value may be too small
> ...
>
> this is handled by the cluster software by stopping all resources,
> then starting on the "preferred" node.
>

Thanks Lars. The first node is preferred. Services are on the first
node to start with, after the split it migrated the services to the
second node (which should not happen without a meatclient
confirmation) and then back again. We used meatclient to avoid the
situation, so what did we do wrong?
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lars.ellenberg at linbit

Oct 2, 2009, 2:11 AM

Post #5 of 5 (893 views)
Permalink
Re: Two node cluster switchback [In reply to]

On Fri, Oct 02, 2009 at 10:21:53AM +0200, James Brackinshaw wrote:
> On Fri, Oct 2, 2009 at 10:17 AM, Lars Ellenberg
> <lars.ellenberg [at] linbit> wrote:
> > On Wed, Sep 30, 2009 at 02:22:32PM +0200, James Brackinshaw wrote:
> >> Hello,
> >>
> >> I have a two node heartbeat setup on Centos 5.3.
> >>
> >> The two nodes are in separate locations and connected only via
> >> ethernet. Because of this we require that a human guarantee that a
> >> node is dead before a switchover occurs. We use meatclient for this.
> >>
> >> Automatic failback is turned off. We would like the primary node to do
> >> all of the work unless we manually switch roles or the primary node
> >> dies.
> >>
> >> We recently had a network outage. We expected that the primary node
> >> would stay active and providing services. Instead, the two nodes
> >> switched roles while the network was being repaired.
> >>
> >> I cannot understand how the role switching happened since we ran no
> >> scripts manually (at least not at the start), and did not run
> >> meatclient.
> >>
> >> Can anyone help me understand why this happened?
> >
> > Connectivity came back.
> >
> >> I attach my log files.
> >
> > I did not have a look.
> >
> > But you are likely to find
> >  WARN node whatever-xy: is dead
> >  ...
> >  CRIT: Cluster node whatever-xy returning after partition.
> >  WARN: Deadtime value may be too small
> > ...
> >
> > this is handled by the cluster software by stopping all resources,
> > then starting on the "preferred" node.
> >
>
> Thanks Lars. The first node is preferred. Services are on the first
> node to start with, after the split it migrated the services to the
> second node (which should not happen without a meatclient
> confirmation) and then back again. We used meatclient to avoid the
> situation, so what did we do wrong?

Then my explanation was not quite aplicable in this particular
situation, and your old heartbeat stuff does handle it all a bit
different, probably both nodes scheduling a shutdown and restart
once they recognized the rejoin after split,
but since one was already in the process of taking over resources,
just waiting for the confirmation of the other node being dead,
it "deferred" that shutdown until "current resource activity finished".

If you read your logs slowly, comparing time stamps,
it will explain what exactly happened.

I don't think you did something wrong. Its just that
heartbeat has not much options to clean up the mess.

"meatware" is NOT there to prevent failovers,
but to confirm reset operations.
After the rejoin, there was nothing to confirm anymore,
as both nodes were able to talk to each other again.

Heartbeat (haresources) is not very flexible when handling
rejoins after partitions.

Pacemaker may or may not be able to handle such a situation
more to your liking, if configured appropriately.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.