Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

disconnecting network of any node cause both nodes fenced

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


M.Sharfuddin at nds

Dec 4, 2011, 12:29 PM

Post #1 of 6 (869 views)
Permalink
disconnecting network of any node cause both nodes fenced

This cluster reboots(fenced) both nodes, if I disconnects network of any
nodes(simulating network failure).

I want that if any node disconnects from network, resources running on
that node should be moved/migrate
to the other node(network connected node)

How can I prevent this cluster to reboot(fence) the healthy node(i.e the
node whose network is up/available/connected).

I am using following STONITH resource
primitive sbd_stonith stonith:external/sbd \
meta target-role="Started" \
op monitor interval="3000" timeout="120" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="120" \
params
sbd_device="/dev/disk/by-id/scsi-360080e50002377b8000002ff4e4bc873"


--
Regards,

Muhammad Sharfuddin
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andreas at hastexo

Dec 4, 2011, 2:47 PM

Post #2 of 6 (840 views)
Permalink
Re: disconnecting network of any node cause both nodes fenced [In reply to]

Hello,

On 12/04/2011 09:29 PM, Muhammad Sharfuddin wrote:
> This cluster reboots(fenced) both nodes, if I disconnects network of any
> nodes(simulating network failure).

Completely loss of network is indistinguishable for a cluster node to a
dead peer.

>
> I want that if any node disconnects from network, resources running on
> that node should be moved/migrate
> to the other node(network connected node)

Use ping RA for connectivity checks and use location constraints to move
resources according to network connectivity (to external ping targets)

>
> How can I prevent this cluster to reboot(fence) the healthy node(i.e the
> node whose network is up/available/connected).

Multiple-failure scenarios are challenging and possible solutions for a
cluster are limited. With enough effort by an administrator every
cluster can be "tested to death".

You can only minimize the possibility of a split-brain:

* use redundant cluster communication paths (limited to two with corosync)
* at least one communication path is direct connected
* use a quorum node

If you are using a network connected fencing device use this network
also for cluster communication.

To prevent stonith death matches use power-off as stonith action or/and
don't start cluster services on system startup.

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now


>
> I am using following STONITH resource
> primitive sbd_stonith stonith:external/sbd \
> meta target-role="Started" \
> op monitor interval="3000" timeout="120" \
> op start interval="0" timeout="120" \
> op stop interval="0" timeout="120" \
> params
> sbd_device="/dev/disk/by-id/scsi-360080e50002377b8000002ff4e4bc873"
>
>
> --
> Regards,
>
> Muhammad Sharfuddin
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
Attachments: signature.asc (0.28 KB)


M.Sharfuddin at nds

Dec 5, 2011, 3:21 AM

Post #3 of 6 (833 views)
Permalink
Re: disconnecting network of any node cause both nodes fenced [In reply to]

On Sun, 2011-12-04 at 23:47 +0100, Andreas Kurz wrote:
> Hello,
>
> On 12/04/2011 09:29 PM, Muhammad Sharfuddin wrote:
> > This cluster reboots(fenced) both nodes, if I disconnects network of any
> > nodes(simulating network failure).
>
> Completely loss of network is indistinguishable for a cluster node to a
> dead peer.
>
> >
> > I want that if any node disconnects from network, resources running on
> > that node should be moved/migrate
> > to the other node(network connected node)
>
> Use ping RA for connectivity checks and use location constraints to move
> resources according to network connectivity (to external ping targets)
>
so in case of having a ping RA with appropriate location rule, does at
least make sure that if any one node lose the network connectivity(i.e
both nodes cant see each other, while only one node is disconnected from
network), the other healthy node(network connected node) wont reboot ...
is it what you said ?

> >
> > How can I prevent this cluster to reboot(fence) the healthy node(i.e the
> > node whose network is up/available/connected).
>
> Multiple-failure scenarios are challenging and possible solutions for a
> cluster are limited. With enough effort by an administrator every
> cluster can be "tested to death".
>
> You can only minimize the possibility of a split-brain:
>
> * use redundant cluster communication paths (limited to two with corosync)
in my test I disconnected every communication path of one node(and both
rebooted)

> * at least one communication path is direct connected
directly connected communication path and ping RA with location rule..
will prevent the reboot of healthy node(network connected node)

> * use a quorum node
>
i.e I should add another node(quorum node) in this two node cluster.

> If you are using a network connected fencing device use this network
> also for cluster communication.
>
> To prevent stonith death matches use power-off as stonith action or/and
> don't start cluster services on system startup.
>
cluster does not start at system startup

> Regards,
> Andreas
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
--
Regards,

Muhammad Sharfuddin

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andreas at hastexo

Dec 5, 2011, 1:37 PM

Post #4 of 6 (831 views)
Permalink
Re: disconnecting network of any node cause both nodes fenced [In reply to]

On 12/05/2011 12:21 PM, Muhammad Sharfuddin wrote:
>
> On Sun, 2011-12-04 at 23:47 +0100, Andreas Kurz wrote:
>> Hello,
>>
>> On 12/04/2011 09:29 PM, Muhammad Sharfuddin wrote:
>>> This cluster reboots(fenced) both nodes, if I disconnects network of any
>>> nodes(simulating network failure).
>>
>> Completely loss of network is indistinguishable for a cluster node to a
>> dead peer.
>>
>>>
>>> I want that if any node disconnects from network, resources running on
>>> that node should be moved/migrate
>>> to the other node(network connected node)
>>
>> Use ping RA for connectivity checks and use location constraints to move
>> resources according to network connectivity (to external ping targets)
>>
> so in case of having a ping RA with appropriate location rule, does at
> least make sure that if any one node lose the network connectivity(i.e
> both nodes cant see each other, while only one node is disconnected from
> network), the other healthy node(network connected node) wont reboot ...
> is it what you said ?

No ... in case of service network loss of one node, resources can move
to the other node if it has a better connectivity. For this to work, the
nodes still need an extra communication path.

>
>>>
>>> How can I prevent this cluster to reboot(fence) the healthy node(i.e the
>>> node whose network is up/available/connected).
>>
>> Multiple-failure scenarios are challenging and possible solutions for a
>> cluster are limited. With enough effort by an administrator every
>> cluster can be "tested to death".
>>
>> You can only minimize the possibility of a split-brain:
>>
>> * use redundant cluster communication paths (limited to two with corosync)
> in my test I disconnected every communication path of one node(and both
> rebooted)

Did you clone the sbd resource? If yes, don't do that. Start it as a
primitive, so in case of a split brain at least one node needs to start
the stonith resource which should give the other node an advantage ...
adding a start-delay should further increase that advantage.

>
>> * at least one communication path is direct connected
> directly connected communication path and ping RA with location rule..
> will prevent the reboot of healthy node(network connected node)

No, don't use the other node as ping target ... that's ccm business ...
directly connected networks are simply less error-prone than switched
networks ... except for administrative interventions ;-)

>
>> * use a quorum node
>>
> i.e I should add another node(quorum node) in this two node cluster.

Yes ... you can add a node in permanent standby mode or starting
corosync without pacemaker should also work fine.

>
>> If you are using a network connected fencing device use this network
>> also for cluster communication.
>>
>> To prevent stonith death matches use power-off as stonith action or/and
>> don't start cluster services on system startup.
>>
> cluster does not start at system startup

fine

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now


>
>> Regards,
>> Andreas
>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA [at] lists
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> --
> Regards,
>
> Muhammad Sharfuddin
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
Attachments: signature.asc (0.28 KB)


lmb at suse

Dec 6, 2011, 8:36 AM

Post #5 of 6 (827 views)
Permalink
Re: disconnecting network of any node cause both nodes fenced [In reply to]

On 2011-12-05T22:37:03, Andreas Kurz <andreas [at] hastexo> wrote:

> Did you clone the sbd resource? If yes, don't do that. Start it as a
> primitive, so in case of a split brain at least one node needs to start
> the stonith resource which should give the other node an advantage ...
> adding a start-delay should further increase that advantage.

start-delay=20s or so is a recommended setting here, yes. A patch to
the external.c plugin to actually relay the "start" to the external/*
agent would be helpful, or perhaps just adding a 20s delay there
directly ... That'd auto-fix this for all users of sbd.

> >> * use a quorum node
> > i.e I should add another node(quorum node) in this two node cluster.
> Yes ... you can add a node in permanent standby mode or starting
> corosync without pacemaker should also work fine.

The latter is probably the better choice, otherwise the node will
participate in the DC elections.

Alternatively, with the more recent sbd versions, you could also have a
redundant network quorum device via iSCSI; that would prevent the node
which loses network connectivity from fencing the other, since it would
have committed suicide.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Dec 6, 2011, 9:20 AM

Post #6 of 6 (826 views)
Permalink
Re: disconnecting network of any node cause both nodes fenced [In reply to]

Hi,

On Tue, Dec 06, 2011 at 05:36:09PM +0100, Lars Marowsky-Bree wrote:
> On 2011-12-05T22:37:03, Andreas Kurz <andreas [at] hastexo> wrote:
>
> > Did you clone the sbd resource? If yes, don't do that. Start it as a
> > primitive, so in case of a split brain at least one node needs to start
> > the stonith resource which should give the other node an advantage ...
> > adding a start-delay should further increase that advantage.
>
> start-delay=20s or so is a recommended setting here, yes. A patch to
> the external.c plugin to actually relay the "start" to the external/*

The agent (and external.c) never sees the start action. What
would need to be patched in the first place is stonithd. And, of
course, we'd need to introduce a new operation for that which
would need to be implemented then in all agents. Or,
alternatively, stonithd could perhaps learn about supported
methods from the agent.

Thanks,

Dejan

> agent would be helpful, or perhaps just adding a 20s delay there
> directly ... That'd auto-fix this for all users of sbd.
>
> > >> * use a quorum node
> > > i.e I should add another node(quorum node) in this two node cluster.
> > Yes ... you can add a node in permanent standby mode or starting
> > corosync without pacemaker should also work fine.
>
> The latter is probably the better choice, otherwise the node will
> participate in the DC elections.
>
> Alternatively, with the more recent sbd versions, you could also have a
> redundant network quorum device via iSCSI; that would prevent the node
> which loses network connectivity from fencing the other, since it would
> have committed suicide.
>
>
> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.