Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

Nodes not rejoining cluster

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


gregg at damagecontrolusa

Mar 29, 2012, 7:30 PM

Post #1 of 9 (2510 views)
Permalink
Nodes not rejoining cluster

I had a circuit breaker go out and take two of the 5 nodes in my cluster
down. Now that their back up and running, they are not rejoining the
cluster.

Here is what I get from crm_mon -1

node 1,2 and 3 itchy, scratchy and walter show the following:
============
Last updated: Thu Mar 29 19:04:05 2012
Last change: Thu Mar 29 19:04:03 2012 via cibadmin on walter
Stack: openais
Current DC: walter - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
5 Nodes configured, 5 expected votes
9 Resources configured.
============

Online: [ itchy scratchy walter butthead timmy ]


On butthead I get

============
Last updated: Thu Mar 29 19:04:24 2012
Last change: Thu Mar 29 18:42:09 2012 via cibadmin on itchy
Stack: openais
Current DC: NONE
5 Nodes configured, 5 expected votes
9 Resources configured.
============

OFFLINE: [ itchy scratchy walter butthead timmy ]


On Timmy, I get

============
Last updated: Thu Mar 29 19:04:20 2012
Last change:
Current DC: NONE
0 Nodes configured, unknown expected votes
0 Resources configured.
============


I don't have anything important running yet. so I can do a full clean up
of everything if needed.

I also get some weird behavior with timmy. I brought this node up with
the host name as timmy.example.com and I changed the host name to timmy
but when the cluster is offline timmy.example.com shows up as offline. I
enter crm node delete timmy.example.com and it goes away until timmy
goes offline again.

Thanks,
Gregg Stock


andrew at beekhof

Mar 29, 2012, 10:15 PM

Post #2 of 9 (2460 views)
Permalink
Re: Nodes not rejoining cluster [In reply to]

Gotta have logs. From all 3 nodes mentioned.
Only then can we determine if the problem is at the corosync or
pacemaker layer - which is the pre-requisit for figuring out what to
do next :)

On Fri, Mar 30, 2012 at 1:30 PM, Gregg Stock <gregg [at] damagecontrolusa> wrote:
> I had a circuit breaker go out and take two of the 5 nodes in my cluster
> down. Now that their back up and running, they are not rejoining the
> cluster.
>
> Here is what I get from crm_mon -1
>
> node 1,2 and 3 itchy, scratchy and walter show the following:
> ============
> Last updated: Thu Mar 29 19:04:05 2012
> Last change: Thu Mar 29 19:04:03 2012 via cibadmin on walter
> Stack: openais
> Current DC: walter - partition with quorum
> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
> 5 Nodes configured, 5 expected votes
> 9 Resources configured.
> ============
>
> Online: [ itchy scratchy walter butthead timmy ]
>
>
> On butthead I get
>
> ============
> Last updated: Thu Mar 29 19:04:24 2012
> Last change: Thu Mar 29 18:42:09 2012 via cibadmin on itchy
> Stack: openais
> Current DC: NONE
> 5 Nodes configured, 5 expected votes
> 9 Resources configured.
> ============
>
> OFFLINE: [ itchy scratchy walter butthead timmy ]
>
>
> On Timmy, I get
>
> ============
> Last updated: Thu Mar 29 19:04:20 2012
> Last change:
> Current DC: NONE
> 0 Nodes configured, unknown expected votes
> 0 Resources configured.
> ============
>
>
> I don't have anything important running yet. so I can do a full clean up of
> everything if needed.
>
> I also get some weird behavior with timmy. I brought this node up with the
> host name as timmy.example.com and I changed the host name to timmy but when
> the cluster is offline timmy.example.com shows up as offline. I enter crm
> node delete timmy.example.com and it goes away until timmy goes offline
> again.
>
> Thanks,
> Gregg Stock
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


gregg at damagecontrolusa

Mar 30, 2012, 8:38 AM

Post #3 of 9 (2466 views)
Permalink
Re: Nodes not rejoining cluster [In reply to]

I took the last 200 lines of each.

On 3/29/2012 10:15 PM, Andrew Beekhof wrote:
> Gotta have logs. From all 3 nodes mentioned.
> Only then can we determine if the problem is at the corosync or
> pacemaker layer - which is the pre-requisit for figuring out what to
> do next :)
>
> On Fri, Mar 30, 2012 at 1:30 PM, Gregg Stock<gregg [at] damagecontrolusa> wrote:
>> I had a circuit breaker go out and take two of the 5 nodes in my cluster
>> down. Now that their back up and running, they are not rejoining the
>> cluster.
>>
>> Here is what I get from crm_mon -1
>>
>> node 1,2 and 3 itchy, scratchy and walter show the following:
>> ============
>> Last updated: Thu Mar 29 19:04:05 2012
>> Last change: Thu Mar 29 19:04:03 2012 via cibadmin on walter
>> Stack: openais
>> Current DC: walter - partition with quorum
>> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
>> 5 Nodes configured, 5 expected votes
>> 9 Resources configured.
>> ============
>>
>> Online: [ itchy scratchy walter butthead timmy ]
>>
>>
>> On butthead I get
>>
>> ============
>> Last updated: Thu Mar 29 19:04:24 2012
>> Last change: Thu Mar 29 18:42:09 2012 via cibadmin on itchy
>> Stack: openais
>> Current DC: NONE
>> 5 Nodes configured, 5 expected votes
>> 9 Resources configured.
>> ============
>>
>> OFFLINE: [ itchy scratchy walter butthead timmy ]
>>
>>
>> On Timmy, I get
>>
>> ============
>> Last updated: Thu Mar 29 19:04:20 2012
>> Last change:
>> Current DC: NONE
>> 0 Nodes configured, unknown expected votes
>> 0 Resources configured.
>> ============
>>
>>
>> I don't have anything important running yet. so I can do a full clean up of
>> everything if needed.
>>
>> I also get some weird behavior with timmy. I brought this node up with the
>> host name as timmy.example.com and I changed the host name to timmy but when
>> the cluster is offline timmy.example.com shows up as offline. I enter crm
>> node delete timmy.example.com and it goes away until timmy goes offline
>> again.
>>
>> Thanks,
>> Gregg Stock
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
Attachments: butthead-corosync.txt (25.6 KB)
  butthead-var-log-messages.txt (25.9 KB)
  timmy-corosync.txt (20.3 KB)
  timmy-var-log-messages.txt (38.4 KB)
  walter-corosync.txt (28.2 KB)
  walter-var-log-messages.txt (28.3 KB)


florian at hastexo

Mar 30, 2012, 9:01 AM

Post #4 of 9 (2460 views)
Permalink
Re: Nodes not rejoining cluster [In reply to]

On Fri, Mar 30, 2012 at 5:38 PM, Gregg Stock <gregg [at] damagecontrolusa> wrote:
> I took the last 200 lines of each.

Can you check the health of the Corosync membership, as per this URL?

http://www.hastexo.com/resources/hints-and-kinks/checking-corosync-cluster-membership

Do _all_ nodes agree on the health of the rings, and on the cluster member list?

Florian

--
Need help with High Availability?
http://www.hastexo.com/now

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


gregg at damagecontrolusa

Mar 30, 2012, 9:09 AM

Post #5 of 9 (2461 views)
Permalink
Re: Nodes not rejoining cluster [In reply to]

That looks good. They were all the same and had the correct ip addresses.

On 3/30/2012 9:01 AM, Florian Haas wrote:
> On Fri, Mar 30, 2012 at 5:38 PM, Gregg Stock<gregg [at] damagecontrolusa> wrote:
>> I took the last 200 lines of each.
> Can you check the health of the Corosync membership, as per this URL?
>
> http://www.hastexo.com/resources/hints-and-kinks/checking-corosync-cluster-membership
>
> Do _all_ nodes agree on the health of the rings, and on the cluster member list?
>
> Florian
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


florian at hastexo

Mar 30, 2012, 9:33 AM

Post #6 of 9 (2463 views)
Permalink
Re: Nodes not rejoining cluster [In reply to]

On Fri, Mar 30, 2012 at 6:09 PM, Gregg Stock <gregg [at] damagecontrolusa> wrote:
> That looks good. They were all the same and had the correct ip addresses.

So you've got both healthy rings, and all 5 nodes have 5 members in
the membership list?

Then this would make it a Pacemaker problem. IIUC the code causing
Pacemaker to discard the update from a node that is "not in our
membership" has actually been removed from 1.1.7[1] so an upgrade may
not be a bad idea, but you'll probably have to wait for a few more
days until packages become available.

Still, out of curiosity, and since you're saying this is a test
cluster: what happens if you shut down corosync and Pacemaker on *all*
the nodes, and bring it back up?

We've had a few people report these "not in our membership" issues on
the list before, and they seem to appear in a very sporadic and
transient fashion, so the root cause (which may well be totally
trivial) hasn't really been found out -- as far as I can tell, at
least. Hence, my question of whether the issue persists after a full
cluster shutdown.

Florian

[1] https://github.com/ClusterLabs/pacemaker/commit/03f6105592281901cc10550b8ad19af4beb5f72f
-- note Andrew will rightfully flame me to a crisp if I've
misinterpreted that commit, so caveat lector. :)

--
Need help with High Availability?
http://www.hastexo.com/now

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


gregg at damagecontrolusa

Mar 30, 2012, 10:45 AM

Post #7 of 9 (2466 views)
Permalink
Re: Nodes not rejoining cluster [In reply to]

The full shutdown and restart fixed it.

Thanks for your help.

On 3/30/2012 9:33 AM, Florian Haas wrote:
> On Fri, Mar 30, 2012 at 6:09 PM, Gregg Stock<gregg [at] damagecontrolusa> wrote:
>> That looks good. They were all the same and had the correct ip addresses.
> So you've got both healthy rings, and all 5 nodes have 5 members in
> the membership list?
>
> Then this would make it a Pacemaker problem. IIUC the code causing
> Pacemaker to discard the update from a node that is "not in our
> membership" has actually been removed from 1.1.7[1] so an upgrade may
> not be a bad idea, but you'll probably have to wait for a few more
> days until packages become available.
>
> Still, out of curiosity, and since you're saying this is a test
> cluster: what happens if you shut down corosync and Pacemaker on *all*
> the nodes, and bring it back up?
>
> We've had a few people report these "not in our membership" issues on
> the list before, and they seem to appear in a very sporadic and
> transient fashion, so the root cause (which may well be totally
> trivial) hasn't really been found out -- as far as I can tell, at
> least. Hence, my question of whether the issue persists after a full
> cluster shutdown.
>
> Florian
>
> [1] https://github.com/ClusterLabs/pacemaker/commit/03f6105592281901cc10550b8ad19af4beb5f72f
> -- note Andrew will rightfully flame me to a crisp if I've
> misinterpreted that commit, so caveat lector. :)
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


florian at hastexo

Mar 30, 2012, 10:52 AM

Post #8 of 9 (2460 views)
Permalink
Re: Nodes not rejoining cluster [In reply to]

On Fri, Mar 30, 2012 at 7:45 PM, Gregg Stock <gregg [at] damagecontrolusa> wrote:
> The full shutdown and restart fixed it.

Hrm. So it's transient after all. Andrew, think you nailed that one
with the commit I referred to upthread, or do you call heisenbug?

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andrew at beekhof

Apr 15, 2012, 4:51 AM

Post #9 of 9 (2388 views)
Permalink
Re: Nodes not rejoining cluster [In reply to]

On Sat, Mar 31, 2012 at 3:33 AM, Florian Haas <florian [at] hastexo> wrote:
> On Fri, Mar 30, 2012 at 6:09 PM, Gregg Stock <gregg [at] damagecontrolusa> wrote:
>> That looks good. They were all the same and had the correct ip addresses.
>
> So you've got both healthy rings, and all 5 nodes have 5 members in
> the membership list?
>
> Then this would make it a Pacemaker problem. IIUC the code causing
> Pacemaker to discard the update from a node that is "not in our
> membership" has actually been removed from 1.1.7[1] so an upgrade may
> not be a bad idea, but you'll probably have to wait for a few more
> days until packages become available.
>
> Still, out of curiosity, and since you're saying this is a test
> cluster: what happens if you shut down corosync and Pacemaker on *all*
> the nodes, and bring it back up?
>
> We've had a few people report these "not in our membership" issues on
> the list before, and they seem to appear in a very sporadic and
> transient fashion, so the root cause (which may well be totally
> trivial) hasn't really been found out -- as far as I can tell, at
> least. Hence, my question of whether the issue persists after a full
> cluster shutdown.
>
> Florian
>
> [1] https://github.com/ClusterLabs/pacemaker/commit/03f6105592281901cc10550b8ad19af4beb5f72f
> -- note Andrew will rightfully flame me to a crisp if I've
> misinterpreted that commit, so caveat lector. :)

Its related, but as mentioned off-list, I've seen the same behaviour
even with that patch.

Somehow the process list never makes it to one of the peers (the
others get it fine) which causes much confusion.
The above patch merely ignores the process list in the cib, the crmd
will still be affected.

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.