Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

getting started - crm hangs when adding resources, even "crm ra classes" hangs

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


phil at macprofessionals

Mar 13, 2012, 10:21 AM

Post #1 of 12 (2128 views)
Permalink
getting started - crm hangs when adding resources, even "crm ra classes" hangs

I'm trying to set up pacemaker for the first time, following the instructions in clusters from scratch, on Debian squeeze, using pacemaker and corosync from squeeze-backports. I seem to have gotten as far as getting two nodes in the cluster:

# crm status
============
Last updated: Tue Mar 13 13:02:37 2012
Last change: Tue Mar 13 12:50:25 2012 via cibadmin on xenhost02
Stack: openais
Current DC: xenhost02 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ xenhost02 xen01 ]

However, that's as far as I can get. The next step in clusters from scratch is configuring an IP address resource. Running this command seems to never terminate, with no output:

crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params ip=192.168.122.101 cidr_netmask=32 op monitor interval=30s

more interestingly, even "crm ra classes" never terminates, again with no output, and nothing appended to syslog.

I've also noticed that if I attempt to stop pacemaker (/etc/init.d/pacemaker stop), it doesn't stop. I get this in syslog:

Mar 13 12:35:52 xen01 crmd: [9937]: info: crm_shutdown: Requesting shutdown
Mar 13 12:35:52 xen01 crmd: [9937]: notice: crm_shutdown: Forcing shutdown in: 1200000ms
Mar 13 12:35:52 xen01 crmd: [9937]: info: do_shutdown_req: Sending shutdown request to DC: xenhost02
Mar 13 12:35:52 xen01 corosync[9897]: [TOTEM ] Retransmit List: 65
Mar 13 12:35:52 xen01 corosync[9897]: [TOTEM ] Retransmit List: 65
Mar 13 12:35:52 xen01 corosync[9897]: [TOTEM ] Retransmit List: 65
[repeating, several times per second]

I can only guess that some lower-level communication between the nodes is not working. The issue is I have no idea what the lower levels are, or how to troubleshoot them. I'm not even really sure what information I should supply to help with troubleshooting. Any guidance would be much appreciated.

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


jsmith at argotec

Mar 13, 2012, 11:21 AM

Post #2 of 12 (2056 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

----- Original Message -----
> From: "Phillip Frost" <phil [at] macprofessionals>
> To: pacemaker [at] oss
> Sent: Tuesday, March 13, 2012 1:21:00 PM
> Subject: [Pacemaker] getting started - crm hangs when adding resources, even "crm ra classes" hangs
>
> I'm trying to set up pacemaker for the first time, following the
> instructions in clusters from scratch, on Debian squeeze, using
> pacemaker and corosync from squeeze-backports. I seem to have gotten
> as far as getting two nodes in the cluster:
>
> # crm status
> ============
> Last updated: Tue Mar 13 13:02:37 2012
> Last change: Tue Mar 13 12:50:25 2012 via cibadmin on xenhost02
> Stack: openais
> Current DC: xenhost02 - partition with quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Online: [ xenhost02 xen01 ]
>
> However, that's as far as I can get. The next step in clusters from
> scratch is configuring an IP address resource. Running this command
> seems to never terminate, with no output:
>
> crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params
> ip=192.168.122.101 cidr_netmask=32 op monitor interval=30s
>
> more interestingly, even "crm ra classes" never terminates, again
> with no output, and nothing appended to syslog.
>
> I've also noticed that if I attempt to stop pacemaker
> (/etc/init.d/pacemaker stop), it doesn't stop. I get this in syslog:
>
> Mar 13 12:35:52 xen01 crmd: [9937]: info: crm_shutdown: Requesting
> shutdown
> Mar 13 12:35:52 xen01 crmd: [9937]: notice: crm_shutdown: Forcing
> shutdown in: 1200000ms
> Mar 13 12:35:52 xen01 crmd: [9937]: info: do_shutdown_req: Sending
> shutdown request to DC: xenhost02
> Mar 13 12:35:52 xen01 corosync[9897]: [TOTEM ] Retransmit List: 65
> Mar 13 12:35:52 xen01 corosync[9897]: [TOTEM ] Retransmit List: 65
> Mar 13 12:35:52 xen01 corosync[9897]: [TOTEM ] Retransmit List: 65
> [repeating, several times per second]
>

You don't have anything in the log from lrmd do you?

In Ubuntu 10.04 there is a bug in glib causing hanging on shutdown as well as hanging on some crm commands - there are patches out to fix it for Ubuntu specifically (https://bugs.launchpad.net/ubuntu/oneiric/+source/cluster-glue/+bug/821732). Not sure if they affect Debian too.

Here is an excerpt from the above bug to test for the problem. I think the steps would work for Debian too:
Open few client->server connections:
lrmadmin -C ; lrmadmin -C ; lrmadmin -C ; lrmadmin -C
Check number of open sockets:
lsof -f | grep lrm_callback_sock | wc -l
Correct value is 2, but it will be 6 or 8. There's a socket leak.

Here is the patch to glib2.0 that was needed to fix:
https://mail.gnome.org/archives/commits-list/2010-November/msg01816.html


HTH

Jake

> I can only guess that some lower-level communication between the
> nodes is not working. The issue is I have no idea what the lower
> levels are, or how to troubleshoot them. I'm not even really sure
> what information I should supply to help with troubleshooting. Any
> guidance would be much appreciated.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


phil at macprofessionals

Mar 13, 2012, 2:59 PM

Post #3 of 12 (2040 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

On Mar 13, 2012, at 2:21 PM, Jake Smith wrote:

>> From: "Phillip Frost" <phil [at] macprofessionals>
>> Subject: [Pacemaker] getting started - crm hangs when adding resources, even "crm ra classes" hangs
>>
>> more interestingly, even "crm ra classes" never terminates, again
>> with no output, and nothing appended to syslog.
>
> In Ubuntu 10.04 there is a bug in glib causing hanging on shutdown as well as hanging on some crm commands - there are patches out to fix it for Ubuntu specifically (https://bugs.launchpad.net/ubuntu/oneiric/+source/cluster-glue/+bug/821732). Not sure if they affect Debian too.

Seems to be the same issue, somewhat. I noticed sometimes I'd get lrmadmin -C to work once, but the 2nd time it would deadlock. That behavior was described in the launchpad link you gave.

It seems what's happened is the glib bug has been patched in debian unstable, and this raexecupstart patch is disabled in the cluster-glue package as described in launchpad. squeeze-backports took the package from unstable, but glib is not patched in squeeze, so raexecupstart.patch is still needed. Not re-enabled in squeeze-backports, however.

So, I built cluster-glue from the debian source package after manually applying that patch, and now I can run lrmadmin -C all day. Now it's also leaking sockets, but I guess I can live with that.

I was playing with building some other versions to see if I could do better, but it seems some time in the past 5 minutes, squeeze-backports got a new version. Either that, or I'm losing my mind. Either way, time to call it a day.

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


dejanmm at fastmail

Mar 14, 2012, 6:16 AM

Post #4 of 12 (2041 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

Hi,

On Tue, Mar 13, 2012 at 05:59:35PM -0400, Phillip Frost wrote:
> On Mar 13, 2012, at 2:21 PM, Jake Smith wrote:
>
> >> From: "Phillip Frost" <phil [at] macprofessionals>
> >> Subject: [Pacemaker] getting started - crm hangs when adding resources, even "crm ra classes" hangs
> >>
> >> more interestingly, even "crm ra classes" never terminates, again
> >> with no output, and nothing appended to syslog.
> >
> > In Ubuntu 10.04 there is a bug in glib causing hanging on shutdown as well as hanging on some crm commands - there are patches out to fix it for Ubuntu specifically (https://bugs.launchpad.net/ubuntu/oneiric/+source/cluster-glue/+bug/821732). Not sure if they affect Debian too.
>
> Seems to be the same issue, somewhat. I noticed sometimes I'd get lrmadmin -C to work once, but the 2nd time it would deadlock. That behavior was described in the launchpad link you gave.
>
> It seems what's happened is the glib bug has been patched in debian unstable, and this raexecupstart patch is disabled in the cluster-glue package as described in launchpad. squeeze-backports took the package from unstable, but glib is not patched in squeeze, so raexecupstart.patch is still needed. Not re-enabled in squeeze-backports, however.
>
> So, I built cluster-glue from the debian source package after manually applying that patch, and now I can run lrmadmin -C all day. Now it's also leaking sockets, but I guess I can live with that.

Do you have upstart at all? In that case, the debian package
shouldn't have the upstart enabled when building cluster-glue.

Cheers,

Dejan

> I was playing with building some other versions to see if I could do better, but it seems some time in the past 5 minutes, squeeze-backports got a new version. Either that, or I'm losing my mind. Either way, time to call it a day.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


florian at hastexo

Mar 14, 2012, 6:25 AM

Post #5 of 12 (2036 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

On Wed, Mar 14, 2012 at 2:16 PM, Dejan Muhamedagic <dejanmm [at] fastmail> wrote:
> Hi,
>
> On Tue, Mar 13, 2012 at 05:59:35PM -0400, Phillip Frost wrote:
>> On Mar 13, 2012, at 2:21 PM, Jake Smith wrote:
>>
>> >> From: "Phillip Frost" <phil [at] macprofessionals>
>> >> Subject: [Pacemaker] getting started - crm hangs when adding resources,    even "crm ra classes" hangs
>> >>
>> >> more interestingly, even "crm ra classes" never terminates, again
>> >> with no output, and nothing appended to syslog.
>> >
>> > In Ubuntu 10.04 there is a bug in glib causing hanging on shutdown as well as hanging on some crm commands - there are patches out to fix it for Ubuntu specifically (https://bugs.launchpad.net/ubuntu/oneiric/+source/cluster-glue/+bug/821732).  Not sure if they affect Debian too.
>>
>> Seems to be the same issue, somewhat. I noticed sometimes I'd get lrmadmin -C to work once, but the 2nd time it would deadlock. That behavior was described in the launchpad link you gave.
>>
>> It seems what's happened is the glib bug has been patched in debian unstable, and this raexecupstart patch is disabled in the cluster-glue package as described in launchpad. squeeze-backports took the package from unstable, but glib is not patched in squeeze, so raexecupstart.patch is still needed. Not re-enabled in squeeze-backports, however.
>>
>> So, I built cluster-glue from the debian source package after manually applying that patch, and now I can run lrmadmin -C all day. Now it's also leaking sockets, but I guess I can live with that.
>
> Do you have upstart at all? In that case, the debian package
> shouldn't have the upstart enabled when building cluster-glue.

The current cluster-glue package in squeeze-backports,
cluster-glue_1.0.9+hg2665-1~bpo60+2, has upstart disabled.
Double-check that you're running that version. If you do, and the
issue persists, please let us know.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


phil at macprofessionals

Mar 14, 2012, 6:37 AM

Post #6 of 12 (2031 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

On Mar 14, 2012, at 9:25 AM, Florian Haas wrote:

>> Do you have upstart at all? In that case, the debian package
>> shouldn't have the upstart enabled when building cluster-glue.
>
> The current cluster-glue package in squeeze-backports,
> cluster-glue_1.0.9+hg2665-1~bpo60+2, has upstart disabled.
> Double-check that you're running that version. If you do, and the
> issue persists, please let us know.

Indeed, that's the version that hit the repo last night when I decided to quit. This morning, I tried that version and concluded I was experiencing the same issue. Concurrently, I was building glib from the squeeze source packages but with the glib patch from https://bugs.launchpad.net/ubuntu/oneiric/+source/cluster-glue/+bug/821732 applied. I thought that fixed my problem. Now I'm not so sure I wasn't confused on which glib and cluster-glue I was using at the time. My debian-package-fu is not good enough to know how to test if my custom build of glib, or the glib from the squeeze repository is installed, and that's making it difficult to draw clear conclusions.

Whatever state I'm in, things seem better. I'm not experiencing deadlocks, nodes are properly joining the cluster, I was able to add a resource, and I'm not leaking sockets. I did have one node refuse to stop pacemakerd, and begin spewing "[TOTEM ] Retransmit List: 8c" to syslog. However, after SIGKILLing lrmd, pacemakerd and the init.d script terminated and I've since been unable to reproduce.

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


florian at hastexo

Mar 14, 2012, 6:45 AM

Post #7 of 12 (2033 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

On Wed, Mar 14, 2012 at 2:37 PM, Phillip Frost
<phil [at] macprofessionals> wrote:
> On Mar 14, 2012, at 9:25 AM, Florian Haas wrote:
>
>>> Do you have upstart at all? In that case, the debian package
>>> shouldn't have the upstart enabled when building cluster-glue.
>>
>> The current cluster-glue package in squeeze-backports,
>> cluster-glue_1.0.9+hg2665-1~bpo60+2, has upstart disabled.
>> Double-check that you're running that version. If you do, and the
>> issue persists, please let us know.
>
> Indeed, that's the version that hit the repo last night when I decided to quit. This morning, I tried that version and concluded I was experiencing the same issue.

Are you absolutely certain?

Can you confirm that you're running the ~bpo60+2 (note trailing "2")
build, that you're actually running an lrmd binary from that version
(meaning: that you properly killed your lrmd prior to installing that
package), _and_ that "lrmadmin -
C" does *not* list "upstart?

Florian

--
Need help with High Availability?
http://www.hastexo.com/now

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


phil at macprofessionals

Mar 14, 2012, 8:58 AM

Post #8 of 12 (2031 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

On Mar 14, 2012, at 9:45 AM, Florian Haas wrote:
>>> The current cluster-glue package in squeeze-backports,
>>> cluster-glue_1.0.9+hg2665-1~bpo60+2, has upstart disabled.
>>> Double-check that you're running that version. If you do, and the
>>> issue persists, please let us know.
>>
>> Indeed, that's the version that hit the repo last night when I decided to quit. This morning, I tried that version and concluded I was experiencing the same issue.
>
> Are you absolutely certain?
>
> Can you confirm that you're running the ~bpo60+2 (note trailing "2")
> build, that you're actually running an lrmd binary from that version
> (meaning: that you properly killed your lrmd prior to installing that
> package), _and_ that "lrmadmin -
> C" does *not* list "upstart?

Let's discard all of my previous conclusions. Apparently I was confused.

Now, I'm sure I'm running +2 on all three nodes. And, I restarted pacemaker and corosync on all the nodes. I'm basing my knowledge of what versions I'm running on apt-cache policy, output copied below. From that, I'm also reasonably sure that whatever patched versions of cluster-glue and glib I built are not installed now.

I can confirm that lrmadmin -C does not list upstart (also below). Nor does it leak sockets, as reported by "lsof -f | grep lrm_callback_sock". However, sometimes pacemakerd will not stop cleanly. I thought it might happen when stopping pacemaker on the current DC, but after successfully reproducing this failure twice, I couldn't do it again. Pacemakerd seems to exit, but fail to notify the other nodes of its shutdown. Syslog is flooded with "Retransmit List" messages (log attached). These persist until I stop corosync. Asked immediately after stopping pacemaker and corosync on one node, "crm status" other nodes will report that node as still online. After a while, the stopped node switches to offline; I assume some timeout is expiring and they are assuming it crashed.

# lrmadmin -C
There are 4 RA classes supported:
lsb
ocf
heartbeat
stonith

# apt-cache policy pacemaker corosync cluster-glue libglib2.0-0
libglib2.0-0:
Installed: 2.24.2-1
Candidate: 2.24.2-1
Version table:
*** 2.24.2-1 0
500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
100 /var/lib/dpkg/status
cluster-glue:
Installed: 1.0.9+hg2665-1~bpo60+2
Candidate: 1.0.9+hg2665-1~bpo60+2
Package pin: 1.0.9+hg2665-1~bpo60+2
Version table:
*** 1.0.9+hg2665-1~bpo60+2 1000
100 http://backports.debian.org/debian-backports/ squeeze-backports/main amd64 Packages
100 /var/lib/dpkg/status
1.0.6-1 1000
500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
corosync:
Installed: 1.4.2-1~bpo60+1
Candidate: 1.4.2-1~bpo60+1
Package pin: 1.4.2-1~bpo60+1
Version table:
*** 1.4.2-1~bpo60+1 1000
100 http://backports.debian.org/debian-backports/ squeeze-backports/main amd64 Packages
100 /var/lib/dpkg/status
1.2.1-4 1000
500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
pacemaker:
Installed: 1.1.6-2~bpo60+1
Candidate: 1.1.6-2~bpo60+1
Package pin: 1.1.6-2~bpo60+1
Version table:
*** 1.1.6-2~bpo60+1 1000
100 http://backports.debian.org/debian-backports/ squeeze-backports/main amd64 Packages
100 /var/lib/dpkg/status
1.0.9.1+hg15626-1 1000
500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
Attachments: pacemaker_shutdown.log.gz (3.46 KB)


florian at hastexo

Mar 14, 2012, 9:33 AM

Post #9 of 12 (2036 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

On Wed, Mar 14, 2012 at 4:58 PM, Phillip Frost
<phil [at] macprofessionals> wrote:
>> Can you confirm that you're running the ~bpo60+2 (note trailing "2")
>> build, that you're actually running an lrmd binary from that version
>> (meaning: that you properly killed your lrmd prior to installing that
>> package), _and_ that "lrmadmin -
>> C" does *not* list "upstart?
>
> Let's discard all of my previous conclusions. Apparently I was confused.
>
> Now, I'm sure I'm running +2 on all three nodes. And, I restarted pacemaker and corosync on all the nodes. I'm basing my knowledge of what versions I'm running on apt-cache policy, output copied below.

"dpkg -l <package>" would also tell you what versions you have
installed, in a more concise fashion.

> I can confirm that lrmadmin -C does not list upstart (also below). Nor does it leak sockets, as reported by "lsof -f | grep lrm_callback_sock".

Yep, no surprise here.

> However, sometimes pacemakerd will not stop cleanly.

OK. Whether this is related to your original problem or not a complete
open question, jftr.

> I thought it might happen when stopping pacemaker on the current DC, but after successfully reproducing this failure twice, I couldn't do it again. Pacemakerd seems to exit, but fail to notify the other nodes of its shutdown. Syslog is flooded with "Retransmit List" messages (log attached). These persist until I stop corosync. Asked immediately after stopping pacemaker and corosync on one node, "crm status" other nodes will report that node as still online. After a while, the stopped node switches to offline; I assume some timeout is expiring and they are assuming it crashed.

You didn't give much other information, so I'm asking this on a hunch:
does your pacemaker service configuration stanza for corosync (either
in /etc/corosync/corosync.conf or in
/etc/corosync/service.d/pacemaker) say "ver: 0" or "ver: 1"?

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


phil at macprofessionals

Mar 14, 2012, 9:55 AM

Post #10 of 12 (2141 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

On Mar 14, 2012, at 12:33 PM, Florian Haas wrote:

>> However, sometimes pacemakerd will not stop cleanly.
>
> OK. Whether this is related to your original problem or not a complete
> open question, jftr.
>
>> I thought it might happen when stopping pacemaker on the current DC, but after successfully reproducing this failure twice, I couldn't do it again. Pacemakerd seems to exit, but fail to notify the other nodes of its shutdown. Syslog is flooded with "Retransmit List" messages (log attached). These persist until I stop corosync. Asked immediately after stopping pacemaker and corosync on one node, "crm status" other nodes will report that node as still online. After a while, the stopped node switches to offline; I assume some timeout is expiring and they are assuming it crashed.
>
> You didn't give much other information, so I'm asking this on a hunch:
> does your pacemaker service configuration stanza for corosync (either
> in /etc/corosync/corosync.conf or in
> /etc/corosync/service.d/pacemaker) say "ver: 0" or "ver: 1"?

I'm not sure if this is the same problem or not. I did experience a symptom that looked to my inexperienced eyes very similar before I installed 1.0.9+hg2665-1~bpo60+2 - that is, I'd try to stop pacemaker, and it wouldn't stop, and I'd get that flood of retransmits in syslog.

To answer your question, I am using "ver: 1". It's worth mentioning that the corosync.conf that comes with the packages in squeeze-backports has a service block with ver: 0 in it, which took me some time to discover. However, I've long ago removed it. Syslog seems to verify that ver: 1 is in effect:

Mar 14 12:02:34 xenhost02 pacemakerd: [7925]: info: get_config_opt: Found 'pacemaker' for option: name
Mar 14 12:02:34 xenhost02 pacemakerd: [7925]: info: get_config_opt: Found '1' for option: ver

After playing with this system more, it seems this problem of "Retransmit List" being flooded to syslog is not only on pacemakerd shutdown. For example, I was just trying to add a DRBD resource, and crm got hung up at "cib commit":

crm(drbd)# cib commit drbd
[long pause, some minutes long]
Could not commit shadow instance 'drbd' to the CIB: Remote node did not respond
ERROR: failed to commit the drbd shadow CIB

"corosync[7915]: [TOTEM ] Retransmit List: b7 b8 b9" is being flooded to syslog.

Every time I try to reproduce this, I can once or twice, but then no more. I'm beginning to think that to set this up, a node has to have been running for some time. I can reproduce it a few times because I try it on each node. Then I have to restart corosync on each node to get things working again, and after that, everything is fine, until I move on, spend some time reading documentation, and try again.

I'm assuming these "Retransmit List" messages in syslog indicate that corosync attempted to send a message to other nodes, did not receive acknowledgement, and is thus attempting to resend them. I know corosync uses IP multicast to communicate with the other nodes. Is it possible that my network is doing something that breaks multicast connectivity? Multicast IP isn't something I've ever had to deal with, so I'm not really sure. It's hard to find anything that talks about configuring a network for multicast that doesn't start talking about IP routers, which isn't relevant in my setup because all the cluster nodes are on the same VLAN, on the same switch. Could this be an issue? Is there a lower-level utility (like, ping) that I can use to verify multicast IP at a lower level?

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andreas at hastexo

Mar 14, 2012, 3:06 PM

Post #11 of 12 (2072 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

On 03/14/2012 05:55 PM, Phillip Frost wrote:
> On Mar 14, 2012, at 12:33 PM, Florian Haas wrote:
>
>>> However, sometimes pacemakerd will not stop cleanly.
>>
>> OK. Whether this is related to your original problem or not a complete
>> open question, jftr.
>>
>>> I thought it might happen when stopping pacemaker on the current DC, but after successfully reproducing this failure twice, I couldn't do it again. Pacemakerd seems to exit, but fail to notify the other nodes of its shutdown. Syslog is flooded with "Retransmit List" messages (log attached). These persist until I stop corosync. Asked immediately after stopping pacemaker and corosync on one node, "crm status" other nodes will report that node as still online. After a while, the stopped node switches to offline; I assume some timeout is expiring and they are assuming it crashed.
>>
>> You didn't give much other information, so I'm asking this on a hunch:
>> does your pacemaker service configuration stanza for corosync (either
>> in /etc/corosync/corosync.conf or in
>> /etc/corosync/service.d/pacemaker) say "ver: 0" or "ver: 1"?
>
> I'm not sure if this is the same problem or not. I did experience a symptom that looked to my inexperienced eyes very similar before I installed 1.0.9+hg2665-1~bpo60+2 - that is, I'd try to stop pacemaker, and it wouldn't stop, and I'd get that flood of retransmits in syslog.
>
> To answer your question, I am using "ver: 1". It's worth mentioning that the corosync.conf that comes with the packages in squeeze-backports has a service block with ver: 0 in it, which took me some time to discover. However, I've long ago removed it. Syslog seems to verify that ver: 1 is in effect:
>
> Mar 14 12:02:34 xenhost02 pacemakerd: [7925]: info: get_config_opt: Found 'pacemaker' for option: name
> Mar 14 12:02:34 xenhost02 pacemakerd: [7925]: info: get_config_opt: Found '1' for option: ver
>
> After playing with this system more, it seems this problem of "Retransmit List" being flooded to syslog is not only on pacemakerd shutdown. For example, I was just trying to add a DRBD resource, and crm got hung up at "cib commit":
>
> crm(drbd)# cib commit drbd
> [long pause, some minutes long]
> Could not commit shadow instance 'drbd' to the CIB: Remote node did not respond
> ERROR: failed to commit the drbd shadow CIB
>
> "corosync[7915]: [TOTEM ] Retransmit List: b7 b8 b9" is being flooded to syslog.
>
> Every time I try to reproduce this, I can once or twice, but then no more. I'm beginning to think that to set this up, a node has to have been running for some time. I can reproduce it a few times because I try it on each node. Then I have to restart corosync on each node to get things working again, and after that, everything is fine, until I move on, spend some time reading documentation, and try again.
>
> I'm assuming these "Retransmit List" messages in syslog indicate that corosync attempted to send a message to other nodes, did not receive acknowledgement, and is thus attempting to resend them. I know corosync uses IP multicast to communicate with the other nodes. Is it possible that my network is doing something that breaks multicast connectivity? Multicast IP isn't something I've ever had to deal with, so I'm not really sure. It's hard to find anything that talks about configuring a network for multicast that doesn't start talking about IP routers, which isn't relevant in my setup because all the cluster nodes are on the same VLAN, on the same switch. Could this be an issue? Is there a lower-level utility (like, ping) that I can use to verify multicast IP at a lower level?
>

Beside testing broadcasts or unicasts (upnp) for corosync ... have you
checked for MTU size problems ... corosync uses the 1500 bytes per
default, as expected from standard ethernet?

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now


> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
Attachments: signature.asc (0.22 KB)


florian at hastexo

Mar 15, 2012, 1:50 AM

Post #12 of 12 (2046 views)
Permalink
Re: getting started - crm hangs when adding resources, even "crm ra classes" hangs [In reply to]

On Wed, Mar 14, 2012 at 5:55 PM, Phillip Frost
<phil [at] macprofessionals> wrote:
> On Mar 14, 2012, at 12:33 PM, Florian Haas wrote:
>
>>> However, sometimes pacemakerd will not stop cleanly.
>>
>> OK. Whether this is related to your original problem or not a complete
>> open question, jftr.
>>
>>> I thought it might happen when stopping pacemaker on the current DC, but after successfully reproducing this failure twice, I couldn't do it again. Pacemakerd seems to exit, but fail to notify the other nodes of its shutdown. Syslog is flooded with "Retransmit List" messages (log attached).

"Retransmit List" log entries have popped up a few times recently, so
I've taken the liberty to write a quick post about them:

http://www.hastexo.com/resources/hints-and-kinks/whats-totem-retransmit-list-all-about-corosync

I've CC'd sdake and fabbione; perhaps if one of you guys could have a
quick peek and give a holler if there's anything that I could improve.
Thanks!

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.