Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

libpthread segfaults

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


marcus at synchromedia

Feb 29, 2012, 2:14 AM

Post #1 of 10 (1036 views)
Permalink
libpthread segfaults

I'v scrapped my old heartbeat config and I'm trying to start from a clean slate with corosync/pacemaker installed on Ubuntu Lucid from the ubuntu-ha PPA (http://ppa.launchpad.net/ubuntu-ha/ppa/ubuntu). I'm running corosync 1.2.0-0ubuntu1 and pacemaker 1.0.8+hg15494-2ubuntu2.

I have one server that is happy, but the other is segfaulting in libpthread in attrd and cib. Everything else on the server appears to be working ok.

There's a chunk of the log file here: http://pastie.org/3486981 Example segfaults:

Feb 29 09:40:27 www4 kernel: attrd[16632]: segfault at 8 ip 00007f563a5970e8 sp 00007fff89a6a7b8 error 6 in libpthread-2.11.1.so[7f563a58a000+18000]
Feb 29 09:40:27 www4 kernel: cib[16630]: segfault at 8 ip 00007f6425fe60e8 sp 00007fff31f29858 error 6 in libpthread-2.11.1.so[7f6425fd9000+18000]

I don't know how to get a stack trace of these as I don't know how these programs are started.

Is this a known problem?

Marcus
--
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info [at] han CRM solutions
marcus [at] synchromedia | http://www.synchromedia.co.uk/



_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


florian at hastexo

Feb 29, 2012, 5:23 AM

Post #2 of 10 (987 views)
Permalink
Re: libpthread segfaults [In reply to]

On Wed, Feb 29, 2012 at 11:14 AM, Marcus Bointon
<marcus [at] synchromedia> wrote:
> I'v scrapped my old heartbeat config and I'm trying to start from a clean slate with corosync/pacemaker

That's excellent!

> installed on Ubuntu Lucid from the ubuntu-ha PPA (http://ppa.launchpad.net/ubuntu-ha/ppa/ubuntu). I'm running corosync 1.2.0-0ubuntu1 and pacemaker 1.0.8+hg15494-2ubuntu2.

Wrong PPA. :)

https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa

You should really run on Corosync 1.4.2+ and Pacemaker 1.1.5+. And
that's what that PPA has. The versions you're running are pretty
ancient. :)

Hope this helps.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


marcus at synchromedia

Feb 29, 2012, 5:28 AM

Post #3 of 10 (987 views)
Permalink
Re: libpthread segfaults [In reply to]

On 29 Feb 2012, at 14:23, Florian Haas wrote:

> You should really run on Corosync 1.4.2+ and Pacemaker 1.1.5+. And
> that's what that PPA has. The versions you're running are pretty
> ancient. :)

Well since none of it's working, I have no problem throwing it all away and starting again!

Thanks very much,

Marcus
--
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info [at] han CRM solutions
marcus [at] synchromedia | http://www.synchromedia.co.uk/



_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


marcus at synchromedia

Feb 29, 2012, 7:19 AM

Post #4 of 10 (1013 views)
Permalink
Re: libpthread segfaults [In reply to]

On 29 Feb 2012, at 14:28, Marcus Bointon wrote:

> Well since none of it's working, I have no problem throwing it all away and starting again!

My crashes have gone away, but I have other issues with the same server. The corosync service starts, and is found by the other node:

============
Last updated: Wed Feb 29 15:07:55 2012
Last change: Wed Feb 29 15:00:10 2012 via crmd on www5
Stack: openais
Current DC: www5 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Node www4: pending
Online: [ www5 ]

Running 'crm status' on www4 just gives "Connection to cluster failed: connection failed". In the log I have these lines from cib:

Feb 29 15:00:18 www4 cib: [24712]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
Feb 29 15:00:18 www4 cib: [24712]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (diges
t: /var/lib/heartbeat/crm/cib.xml.sig)
Feb 29 15:00:18 www4 cib: [24712]: WARN: retrieveCib: Cluster configuration not found: /var/lib/heartbeat/crm/cib.xml
Feb 29 15:00:18 www4 cib: [24712]: WARN: readCibXmlFile: Primary configuration corrupt or unusable, trying backup...
Feb 29 15:00:18 www4 cib: [24712]: WARN: readCibXmlFile: Continuing with an empty configuration.
Feb 29 15:00:18 www4 cib: [24712]: info: validate_with_relaxng: Creating RNG parser context
Feb 29 15:00:18 www4 corosync[24705]: [pcmk ] info: spawn_child: Forked child 24712 for process cib
Feb 29 15:00:18 www4 cib: [24712]: info: startCib: CIB Initialization completed successfully
Feb 29 15:00:18 www4 cib: [24712]: info: get_cluster_type: Cluster type is: 'openais'
Feb 29 15:00:18 www4 cib: [24712]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plu
gin)
Feb 29 15:00:18 www4 cib: [24712]: info: init_ais_connection_classic: Creating connection to our Corosync plugin
Feb 29 15:00:18 www4 cib: [24712]: info: init_ais_connection_classic: Connection to our AIS plugin (9) failed: Library error (2
)
Feb 29 15:00:18 www4 cib: [24712]: CRIT: cib_init: Cannot sign in to the cluster... terminating

cib appears to be fine on www5. I've never touched anything in /var/lib/heartbeat/crm - this is a completely vanilla config, though it may be that there are remnants of the old heartbeat config (which was only on www4) causing this. Can I just copy the contents of that folder from the other server?

Marcus
--
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info [at] han CRM solutions
marcus [at] synchromedia | http://www.synchromedia.co.uk/



_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


florian at hastexo

Feb 29, 2012, 7:33 AM

Post #5 of 10 (988 views)
Permalink
Re: libpthread segfaults [In reply to]

On Wed, Feb 29, 2012 at 4:19 PM, Marcus Bointon
<marcus [at] synchromedia> wrote:

> cib appears to be fine on www5. I've never touched anything in /var/lib/heartbeat/crm - this is a completely vanilla config, though it may be that there are remnants of the old heartbeat config (which was only on www4) causing this. Can I just copy the contents of that folder from the other server?

No, there's an easier way to fix that problem. :)

You said this was a vanilla config that needn't be preserved, right?
Shut down Corosync on both nodes. Kill the contents of
/var/lib/heartbeat/crm/. Then bring everything back up.

That would also be a good opportunity to change your pacemaker service
configuration from "ver: 0" to "ver: 1" and start pacemakerd ("service
corosync start; service pacemaker start").

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


marcus at synchromedia

Feb 29, 2012, 8:14 AM

Post #6 of 10 (987 views)
Permalink
Re: libpthread segfaults [In reply to]

On 29 Feb 2012, at 16:33, Florian Haas wrote:

> No, there's an easier way to fix that problem. :)
>
> You said this was a vanilla config that needn't be preserved, right?
> Shut down Corosync on both nodes. Kill the contents of
> /var/lib/heartbeat/crm/. Then bring everything back up.

That worked fine on www5 but not www4, which didn't recreate the cib files. This time though it did not log any errors, all looks reasonable, but crm status is still failing to connect, there's still no cib process, and now www5 can't seem to see it either. I tried copying over the cib files from www5 (which seemed to be an empty xml config) but it didn't help: cib still isn't running.

Also now www5 no longer finds itself - crm status reports 0 nodes.

> That would also be a good opportunity to change your pacemaker service
> configuration from "ver: 0" to "ver: 1" and start pacemakerd ("service
> corosync start; service pacemaker start").

OK, I've done that, but I don't think that's my problem right now! pacemakerd is running on both nodes.

Marcus
--
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info [at] han CRM solutions
marcus [at] synchromedia | http://www.synchromedia.co.uk/



_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


florian at hastexo

Feb 29, 2012, 8:43 AM

Post #7 of 10 (983 views)
Permalink
Re: libpthread segfaults [In reply to]

On Wed, Feb 29, 2012 at 5:14 PM, Marcus Bointon
<marcus [at] synchromedia> wrote:
> On 29 Feb 2012, at 16:33, Florian Haas wrote:
>
>> No, there's an easier way to fix that problem. :)
>>
>> You said this was a vanilla config that needn't be preserved, right?
>> Shut down Corosync on both nodes. Kill the contents of
>> /var/lib/heartbeat/crm/. Then bring everything back up.
>
> That worked fine on www5 but not www4, which didn't recreate the cib files. This time though it did not log any errors, all looks reasonable, but crm status is still failing to connect, there's still no cib process, and now www5 can't seem to see it either. I tried copying over the cib files from www5 (which seemed to be an empty xml config) but it didn't help: cib still isn't running.
>
> Also now www5 no longer finds itself - crm status reports 0 nodes.

My hunch is that you never properly shut down corosync on that one.
Did you check your ps output so see if it was really down? Corosync
1.2.x had some nasty shutdown issues when running with Pacemaker.

Let us know if that helped. Thanks!
Cheers,

Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


marcus at synchromedia

Feb 29, 2012, 9:46 AM

Post #8 of 10 (987 views)
Permalink
Re: libpthread segfaults [In reply to]

On 29 Feb 2012, at 17:43, Florian Haas wrote:

> My hunch is that you never properly shut down corosync on that one.
> Did you check your ps output so see if it was really down? Corosync
> 1.2.x had some nasty shutdown issues when running with Pacemaker.

I shut down or killed anything vaguely related to corosync/crm/heartbeat/crm/cib and restarted corosync and pacemaker.

Now on www4 I can see a pacemaker process with crmd, pengine, lrmd and stonithd child processes, and on www5 I see those plus attrd and cib (which curiously are the same processes that were reporting segfaults when I was running the old version). www4 is correspondingly still failing to connect to cib.

Starting corosync by itself appears to work correctly on both - the logs show they see each other, no errors.

If on www4 I start attrd and cib manually (as root), they do run, and crm then manages to connect but reports no nodes. crm on www5 sees www4, but it's marked as 'pending'. pcmk on www4 logs that it can see www5.

Marcus
--
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info [at] han CRM solutions
marcus [at] synchromedia | http://www.synchromedia.co.uk/



_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


florian at hastexo

Feb 29, 2012, 12:03 PM

Post #9 of 10 (980 views)
Permalink
Re: libpthread segfaults [In reply to]

On 02/29/12 18:46, Marcus Bointon wrote:
> On 29 Feb 2012, at 17:43, Florian Haas wrote:
>
>> My hunch is that you never properly shut down corosync on that one.
>> Did you check your ps output so see if it was really down? Corosync
>> 1.2.x had some nasty shutdown issues when running with Pacemaker.
>
> I shut down or killed anything vaguely related to corosync/crm/heartbeat/crm/cib and restarted corosync and pacemaker.
>
> Now on www4 I can see a pacemaker process with crmd, pengine, lrmd and stonithd child processes, and on www5 I see those plus attrd and cib (which curiously are the same processes that were reporting segfaults when I was running the old version). www4 is correspondingly still failing to connect to cib.
>
> Starting corosync by itself appears to work correctly on both - the logs show they see each other, no errors.
>
> If on www4 I start attrd and cib manually (as root), they do run, and crm then manages to connect but reports no nodes. crm on www5 sees www4, but it's marked as 'pending'. pcmk on www4 logs that it can see www5.
>
> Marcus

And you're sure you've got a healthy Corosync membership?
"corosync-cfgtool -s" shows all rings healthy? "corosync-objctl | grep
member" shows 2 members?

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


marcus at synchromedia

Feb 29, 2012, 12:16 PM

Post #10 of 10 (979 views)
Permalink
Re: libpthread segfaults [In reply to]

On 29 Feb 2012, at 21:03, Florian Haas wrote:

> And you're sure you've got a healthy Corosync membership?
> "corosync-cfgtool -s" shows all rings healthy? "corosync-objctl | grep
> member" shows 2 members?

I'm not sure what the output is supposed to look like, but it certainly gives the impression of being healthy:

On www4:
Printing ring status.
Local node ID 192961885
RING ID 0
id = 192.168.0.11
status = ring 0 active with no faults

On www5:
Printing ring status.
Local node ID 343956829
RING ID 0
id = 192.168.0.148
status = ring 0 active with no faults

Are they each meant to show both nodes here?

On both nodes:
runtime.totem.pg.mrp.srp.members.343956829.ip=r(0) ip(192.168.0.148)
runtime.totem.pg.mrp.srp.members.343956829.join_count=1
runtime.totem.pg.mrp.srp.members.343956829.status=joined
runtime.totem.pg.mrp.srp.members.192961885.ip=r(0) ip(192.168.0.11)
runtime.totem.pg.mrp.srp.members.192961885.join_count=1
runtime.totem.pg.mrp.srp.members.192961885.status=joined

But crm status gives this on www4 (this is still running my manually launched cib/attrd):

============
Last updated: Wed Feb 29 20:14:08 2012
Last change: Wed Feb 29 17:34:55 2012
Current DC: NONE
0 Nodes configured, unknown expected votes
0 Resources configured.
============

and this on www5

============
Last updated: Wed Feb 29 20:14:01 2012
Last change: Wed Feb 29 17:29:20 2012 via crmd on www5
Stack: openais
Current DC: www5 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Node www4: pending
Online: [ www5 ]

Any the wiser?

Marcus
--
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info [at] han CRM solutions
marcus [at] synchromedia | http://www.synchromedia.co.uk/



_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.