Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

2 node STONITH, beginner questions.

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


smith.c.tech at gmail

Nov 5, 2009, 9:29 AM

Post #1 of 5 (884 views)
Permalink
2 node STONITH, beginner questions.

Hi All-

I have read all the documentation I can find, including Dejan's PDF on the
subject but still have some questions regarding STONITH. I have so far been
testing with the external SSH plugin just to get a feel for how it works and
will be using an APC SNMP device in production.

So far it has been much more straight-forward than I had expected, however,
I have a couple of questions regarding certain scenarios that may occur and
how STONITH/stonithd reacts. If someone can weigh in and offer some insight,
it would help clear this up for me! This is all being tested on a two node
cluster with heartbeat + pacemaker.

- So far I have noticed that, when disconnecting all network devices on a
node, the STONITH survivor is the node that was DC before network
connections dropped. Is there a way to migrate the role of DC to another
node?

- When all resources are running on node1, and node2 is the DC, if I
unplug/`ifconfig down` heartbeat's interfaces on either node, node1 becomes
STONITH victim and resources are migrated to node2. After disconnection,
both nodes can reach the outside network and it would be okay for resources
to run on either. Is there a way to work with scoring/pingd/something-else
so that the node not running any resources becomes the victim to avoid
failover? Does resource scoring have influence on STONITH at all?

- With SNMP and other stonith plugins that require network connectivity, is
it to be assumed that a node whos lost network connectivity is as good as
dead, STONITH'd from the other node that is still able to reach the STONITH
device? From what i've found during my initial tests, when a node drops
from the network it attempts to STONITH the other but can't connect and
fails. Is this the way it is intended to work? Both nodes STONITH each
other and the one that succeeds wins?

Thanks in advance!
CSMITH
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


TSERONG at novell

Nov 5, 2009, 10:30 PM

Post #2 of 5 (828 views)
Permalink
Re: 2 node STONITH, beginner questions. [In reply to]

>>> On 11/6/2009 at 04:29 AM, c smith <smith.c.tech [at] gmail> wrote:
> Hi All-
>
> I have read all the documentation I can find, including Dejan's PDF on the
> subject but still have some questions regarding STONITH. I have so far been
> testing with the external SSH plugin just to get a feel for how it works and
> will be using an APC SNMP device in production.
>
> So far it has been much more straight-forward than I had expected, however,
> I have a couple of questions regarding certain scenarios that may occur and
> how STONITH/stonithd reacts. If someone can weigh in and offer some insight,
> it would help clear this up for me! This is all being tested on a two node
> cluster with heartbeat + pacemaker.
>
> - So far I have noticed that, when disconnecting all network devices on a
> node, the STONITH survivor is the node that was DC before network
> connections dropped. Is there a way to migrate the role of DC to another
> node?

Well, you could stop heartbeat on the DC node before disconnecting the network, but you probably don't want to do that. For all practical purposes, it shouldn't matter where the DC is, or which node takes over resources after split-brain, provided one node is killed and/or the cluster is able to reform after STONITH.

> - When all resources are running on node1, and node2 is the DC, if I
> unplug/`ifconfig down` heartbeat's interfaces on either node, node1 becomes
> STONITH victim and resources are migrated to node2. After disconnection,
> both nodes can reach the outside network and it would be okay for resources
> to run on either. Is there a way to work with scoring/pingd/something-else
> so that the node not running any resources becomes the victim to avoid
> failover? Does resource scoring have influence on STONITH at all?

Resource scores don't influence STONITH. It's a question of "the other node looks like it's dead, I'd better make sure it's *really* dead, regardless of where the resources are".

> - With SNMP and other stonith plugins that require network connectivity, is
> it to be assumed that a node whos lost network connectivity is as good as
> dead, STONITH'd from the other node that is still able to reach the STONITH
> device? From what i've found during my initial tests, when a node drops
> from the network it attempts to STONITH the other but can't connect and
> fails. Is this the way it is intended to work? Both nodes STONITH each
> other and the one that succeeds wins?

That's what'll happen. Absent evidence of life from either node, the only safe thing to do is try to kill it. With only two nodes, you can't assume anything else, as there's no clear majority. By comparison, if there were three nodes, and one node couldn't see the others, it could safely assume that *it* was faulty.

HTH,

Tim


--
Tim Serong <tserong [at] novell>
Senior Clustering Engineer, Novell Inc.



_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


smith.c.tech at gmail

Nov 6, 2009, 11:06 AM

Post #3 of 5 (823 views)
Permalink
Re: 2 node STONITH, beginner questions. [In reply to]

On Thu, Nov 5, 2009 at 10:30 PM, Tim Serong <TSERONG [at] novell> wrote:

>
> That's what'll happen. Absent evidence of life from either node, the only
> safe thing to do is try to kill it. With only two nodes, you can't assume
> anything else, as there's no clear majority. By comparison, if there were
> three nodes, and one node couldn't see the others, it could safely assume
> that *it* was faulty.
>
> HTH,
>
> Tim
>
>
Hi Tim-

Thanks much for the quick response. One more scenario/question.. I've got
a 2 node cluster with 3 NICs each, 2 of which are used for direct crossover
cluster communication and the other goes to the switched network+stonith
device. If those 2 cross-connections are degraded such that cluster
communication ceases, each node will send a STONITH request to the device
for its peer, correct? In the event that both requests make it to the
STONITh device, both nodes would be shot? Is this a design flaw on my
part? Should all 3 interfaces be used for cluster communication?

Thanks again!
CSMITH
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


TSERONG at novell

Nov 8, 2009, 5:54 PM

Post #4 of 5 (805 views)
Permalink
Re: 2 node STONITH, beginner questions. [In reply to]

On 11/7/2009 at 06:06 AM, c smith <smith.c.tech [at] gmail> wrote:
> On Thu, Nov 5, 2009 at 10:30 PM, Tim Serong <TSERONG [at] novell> wrote:
> >
> > That's what'll happen. Absent evidence of life from either node, the only
> > safe thing to do is try to kill it. With only two nodes, you can't assume
> > anything else, as there's no clear majority. By comparison, if there were
> > three nodes, and one node couldn't see the others, it could safely assume
> > that *it* was faulty.
> >
> > HTH,
> >
> > Tim
> >
> >
> Hi Tim-
>
> Thanks much for the quick response. One more scenario/question.. I've got
> a 2 node cluster with 3 NICs each, 2 of which are used for direct crossover
> cluster communication and the other goes to the switched network+stonith
> device. If those 2 cross-connections are degraded such that cluster
> communication ceases, each node will send a STONITH request to the device
> for its peer, correct?

Yes.

> In the event that both requests make it to the
> STONITh device, both nodes would be shot?

Yes. Speaking of which, you might be mildly interested in reading:

http://ourobengr.com/ha

> Is this a design flaw on my part? Should all 3 interfaces be used for
> cluster communication?

Depends how paranoid you are :) On the general principle that one tries to avoid single points of failure, you've already achieved this for cluster communication by having two network links.

That being said, questions to consider include:

- If one link fails, can you get out there and fix it before the second
fails, and STONITH ensures?
- What is the chance of both network links failing simultaneously?
(possibly greater for, say, a dual port NIC vs. separate single-port
NICs... Or two NICs on the same bus, vs. different busses)
- If two links failed, what's the likelihood the third would also fail
at the same time (somewhere, there is a point of diminishing returns)?

It's also worth thinking about the single connection to the STONITH device, which could also fail. This won't necessarily be catastrophic (one node won't take over the other's resources unless STONITH succeeds, so there shouldn't be any problem with data corruption) but it does mean that failover won't occur without manual intervention if the STONITH device is inaccessible.

Regards,

Tim


--
Tim Serong <tserong [at] novell>
Senior Clustering Engineer, Novell Inc.



_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


smith.c.tech at gmail

Nov 9, 2009, 9:57 AM

Post #5 of 5 (800 views)
Permalink
Re: 2 node STONITH, beginner questions. [In reply to]

> Yes. Speaking of which, you might be mildly interested in reading:
>
> http://ourobengr.com/ha
>
>
Already have. :) I believe that article and Dejan's PDF linked on
clusterlabs.org provide the best documentation on this subject. Good work
indeed!



> Depends how paranoid you are :) On the general principle that one tries to
> avoid single points of failure, you've already achieved this for cluster
> communication by having two network links.
>
>
This actually went into production this weekend and was implemented in
reaction to catastrophic openais RRP failure. Overly paranoid, overly
redundant but now we sleep well at night. Thanks for the help, Tim

Best,
CSMITH
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.