Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

Multiple thread after rebooting server: the node doesn't go online

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


gdimilia at cfa

Nov 12, 2009, 3:21 PM

Post #1 of 10 (653 views)
Permalink
Multiple thread after rebooting server: the node doesn't go online

I set up a cluster of two servers CentOS 5.4 x86_64 with pacemaker
1.06 and corosync 1.1.2

I only installed the x86_64 packages (yum install pacemaker try to
install also the 32 bits one).

I configured a shared cluster IP (it's a public ip) and a cluster
website.

Everything work fine if i try to stop corosync on one of the two
servers (the services pass from one machine to the other without
problems), but if I reboot one server, when it returns alive it cannot
go online in the cluster.
I also noticed that there are several thread of corosync and if I kill
all of them and then I start again corosync, everything work fine again.

I don't know what is happening and I'm not able to reproduce the same
situation on some virtual servers!

Thanks,
Giovanni



the configuration of corosync is the following:

##############################################
# Please read the corosync.conf.5 manual page
compatibility: whitetank

aisexec {
# Run as root - this is necessary to be able to manage resources with
Pacemaker
user: root
group: root
}

service {
# Load the Pacemaker Cluster Resource Manager
ver: 0
name: pacemaker
use_mgmtd: yes
use_logd: yes
}

totem {
version: 2

# How long before declaring a token lost (ms)
token: 5000

# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 10

# How long to wait for join messages in the membership protocol (ms)
join: 1000

# How long to wait for consensus to be achieved before starting a new
round of membership configuration (ms)
consensus: 2500

# Turn off the virtual synchrony filter
vsftype: none

# Number of messages that may be sent by one processor on receipt of
the token
max_messages: 20

# Stagger sending the node join messages by 1..send_join ms
send_join: 45

# Limit generated nodeids to 31-bits (positive signed integers)
clear_node_high_bit: yes

# Disable encryption
secauth: off

# How many threads to use for encryption/decryption
threads: 0

# Optionally assign a fixed node id (integer)
# nodeid: 1234

interface {
ringnumber: 0

# The following values need to be set based on your environment
bindnetaddr: XXX.XXX.XXX.0 #here I put the right ip for my configuration
mcastaddr: 226.94.1.1
mcastport: 4000
}
}

logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: yes
logfile: /tmp/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

amf {
mode: disabled
}

##################################################



_______________________________________________
Pacemaker mailing list
Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


kerdosa at gmail

Nov 13, 2009, 10:36 AM

Post #2 of 10 (621 views)
Permalink
Re: Multiple thread after rebooting server: the node doesn't go online [In reply to]

Hi,

I have the same problem in CentOS 5.3 with pacemaker-1.0.5 and
openais-0.80.5. This is openais bug! Two problems.
1. Starting openais service gets seg fault sometime. It more likely happens
if openais service get started before syslog.
2. The seg fault handler of openais calls syslog(). The syslog is one of
UNSAFE function that must not be called from signal handler because it is
non-reentrent function.

To fix this issue: get the openais source, find sigsegv_handler function
exec/main.c and just comment out log_flush(), shown below. Then recompile
and isntall it(make and make install). The log_flush should be removed from
all signal handlers in openais code base. I am still not sure where seg
fault occurs, but commenting out log_flush prevents seg fault.


-------------------------------------------------------------------------
static void sigsegv_handler (int num)
{
signal (SIGSEGV, SIG_DFL);
// log_flush ();
raise (SIGSEGV);
}

Thanks
hj

On Thu, Nov 12, 2009 at 4:21 PM, Giovanni Di Milia <gdimilia [at] cfa
> wrote:

> I set up a cluster of two servers CentOS 5.4 x86_64 with pacemaker 1.06 and
> corosync 1.1.2
>
> I only installed the x86_64 packages (yum install pacemaker try to install
> also the 32 bits one).
>
> I configured a shared cluster IP (it's a public ip) and a cluster website.
>
> Everything work fine if i try to stop corosync on one of the two servers
> (the services pass from one machine to the other without problems), but if I
> reboot one server, when it returns alive it cannot go online in the cluster.
> I also noticed that there are several thread of corosync and if I kill all
> of them and then I start again corosync, everything work fine again.
>
> I don't know what is happening and I'm not able to reproduce the same
> situation on some virtual servers!
>
> Thanks,
> Giovanni
>
>
>
> the configuration of corosync is the following:
>
> ##############################################
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
>
> aisexec {
> # Run as root - this is necessary to be able to manage resources
> with Pacemaker
> user: root
> group: root
> }
>
> service {
> # Load the Pacemaker Cluster Resource Manager
> ver: 0
> name: pacemaker
> use_mgmtd: yes
> use_logd: yes
> }
>
> totem {
> version: 2
>
> # How long before declaring a token lost (ms)
> token: 5000
>
> # How many token retransmits before forming a new configuration
> token_retransmits_before_loss_const: 10
>
> # How long to wait for join messages in the membership protocol (ms)
> join: 1000
>
> # How long to wait for consensus to be achieved before starting a
> new round of membership configuration (ms)
> consensus: 2500
>
> # Turn off the virtual synchrony filter
> vsftype: none
>
> # Number of messages that may be sent by one processor on receipt of
> the token
> max_messages: 20
>
> # Stagger sending the node join messages by 1..send_join ms
> send_join: 45
>
> # Limit generated nodeids to 31-bits (positive signed integers)
> clear_node_high_bit: yes
>
> # Disable encryption
> secauth: off
>
> # How many threads to use for encryption/decryption
> threads: 0
>
> # Optionally assign a fixed node id (integer)
> # nodeid: 1234
>
> interface {
> ringnumber: 0
>
> # The following values need to be set based on your
> environment
> bindnetaddr: XXX.XXX.XXX.0 #here I put the right ip for my configuration
> mcastaddr: 226.94.1.1
> mcastport: 4000
> }
> }
>
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: yes
> to_syslog: yes
> logfile: /tmp/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> ##################################################
>
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>



--
Dream with longterm vision!
kerdosa


gdimilia at cfa

Nov 13, 2009, 12:08 PM

Post #3 of 10 (618 views)
Permalink
Re: Multiple thread after rebooting server: the node doesn't go online [In reply to]

Thank you very much for your response.

The only thing I really don't understand is: why this problem doesn't
appear in all my simulations?
I configured at least 7 couple of virtual servers with vmware 2 and
CentOS 5.3 and 5.4 (32 and 64 bits) and I never had this kind of
problems!

The only difference in the configuration is that I used private IPs
for the simulations and public IPs for the real servers, but I don't
think it is important.

Thanks for your patience,
Giovanni



On Nov 13, 2009, at 1:36 PM, hj lee wrote:

> Hi,
>
> I have the same problem in CentOS 5.3 with pacemaker-1.0.5 and
> openais-0.80.5. This is openais bug! Two problems.
> 1. Starting openais service gets seg fault sometime. It more likely
> happens if openais service get started before syslog.
> 2. The seg fault handler of openais calls syslog(). The syslog is
> one of UNSAFE function that must not be called from signal handler
> because it is non-reentrent function.
>
> To fix this issue: get the openais source, find sigsegv_handler
> function exec/main.c and just comment out log_flush(), shown below.
> Then recompile and isntall it(make and make install). The log_flush
> should be removed from all signal handlers in openais code base. I
> am still not sure where seg fault occurs, but commenting out
> log_flush prevents seg fault.
>
>
> -------------------------------------------------------------------------
> static void sigsegv_handler (int num)
> {
> signal (SIGSEGV, SIG_DFL);
> // log_flush ();
> raise (SIGSEGV);
> }
>
> Thanks
> hj
>
> On Thu, Nov 12, 2009 at 4:21 PM, Giovanni Di Milia <gdimilia [at] cfa
> > wrote:
> I set up a cluster of two servers CentOS 5.4 x86_64 with pacemaker
> 1.06 and corosync 1.1.2
>
> I only installed the x86_64 packages (yum install pacemaker try to
> install also the 32 bits one).
>
> I configured a shared cluster IP (it's a public ip) and a cluster
> website.
>
> Everything work fine if i try to stop corosync on one of the two
> servers (the services pass from one machine to the other without
> problems), but if I reboot one server, when it returns alive it
> cannot go online in the cluster.
> I also noticed that there are several thread of corosync and if I
> kill all of them and then I start again corosync, everything work
> fine again.
>
> I don't know what is happening and I'm not able to reproduce the
> same situation on some virtual servers!
>
> Thanks,
> Giovanni
>
>
>
> the configuration of corosync is the following:
>
> ##############################################
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
>
> aisexec {
> # Run as root - this is necessary to be able to manage
> resources with Pacemaker
> user: root
> group: root
> }
>
> service {
> # Load the Pacemaker Cluster Resource Manager
> ver: 0
> name: pacemaker
> use_mgmtd: yes
> use_logd: yes
> }
>
> totem {
> version: 2
>
> # How long before declaring a token lost (ms)
> token: 5000
>
> # How many token retransmits before forming a new configuration
> token_retransmits_before_loss_const: 10
>
> # How long to wait for join messages in the membership
> protocol (ms)
> join: 1000
>
> # How long to wait for consensus to be achieved before
> starting a new round of membership configuration (ms)
> consensus: 2500
>
> # Turn off the virtual synchrony filter
> vsftype: none
>
> # Number of messages that may be sent by one processor on
> receipt of the token
> max_messages: 20
>
> # Stagger sending the node join messages by 1..send_join ms
> send_join: 45
>
> # Limit generated nodeids to 31-bits (positive signed integers)
> clear_node_high_bit: yes
>
> # Disable encryption
> secauth: off
>
> # How many threads to use for encryption/decryption
> threads: 0
>
> # Optionally assign a fixed node id (integer)
> # nodeid: 1234
>
> interface {
> ringnumber: 0
>
> # The following values need to be set based on your
> environment
> bindnetaddr: XXX.XXX.XXX.0 #here I put the right ip for my
> configuration
> mcastaddr: 226.94.1.1
> mcastport: 4000
> }
> }
>
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: yes
> to_syslog: yes
> logfile: /tmp/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> ##################################################
>
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
>
> --
> Dream with longterm vision!
> kerdosa
> _______________________________________________
> Pacemaker mailing list
> Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


kerdosa at gmail

Nov 16, 2009, 1:51 PM

Post #4 of 10 (600 views)
Permalink
Re: Multiple thread after rebooting server: the node doesn't go online [In reply to]

Hi,

Please disable syslog in openais.conf, and try it again. It seems this issue
is related to fork() call and syslog().

hj

On Fri, Nov 13, 2009 at 1:08 PM, Giovanni Di Milia <gdimilia [at] cfa
> wrote:

> Thank you very much for your response.
>
> The only thing I really don't understand is: why this problem doesn't
> appear in all my simulations?
> I configured at least 7 couple of virtual servers with vmware 2 and CentOS
> 5.3 and 5.4 (32 and 64 bits) and I never had this kind of problems!
>
> The only difference in the configuration is that I used private IPs for the
> simulations and public IPs for the real servers, but I don't think it is
> important.
>
> Thanks for your patience,
> Giovanni
>
>
>
> On Nov 13, 2009, at 1:36 PM, hj lee wrote:
>
> Hi,
>
> I have the same problem in CentOS 5.3 with pacemaker-1.0.5 and
> openais-0.80.5. This is openais bug! Two problems.
> 1. Starting openais service gets seg fault sometime. It more likely happens
> if openais service get started before syslog.
> 2. The seg fault handler of openais calls syslog(). The syslog is one of
> UNSAFE function that must not be called from signal handler because it is
> non-reentrent function.
>
> To fix this issue: get the openais source, find sigsegv_handler function
> exec/main.c and just comment out log_flush(), shown below. Then recompile
> and isntall it(make and make install). The log_flush should be removed from
> all signal handlers in openais code base. I am still not sure where seg
> fault occurs, but commenting out log_flush prevents seg fault.
>
>
> -------------------------------------------------------------------------
> static void sigsegv_handler (int num)
> {
> signal (SIGSEGV, SIG_DFL);
> // log_flush ();
> raise (SIGSEGV);
> }
>
> Thanks
> hj
>
> On Thu, Nov 12, 2009 at 4:21 PM, Giovanni Di Milia <
> gdimilia [at] cfa> wrote:
>
>> I set up a cluster of two servers CentOS 5.4 x86_64 with pacemaker 1.06
>> and corosync 1.1.2
>>
>> I only installed the x86_64 packages (yum install pacemaker try to install
>> also the 32 bits one).
>>
>> I configured a shared cluster IP (it's a public ip) and a cluster website.
>>
>> Everything work fine if i try to stop corosync on one of the two servers
>> (the services pass from one machine to the other without problems), but if I
>> reboot one server, when it returns alive it cannot go online in the cluster.
>> I also noticed that there are several thread of corosync and if I kill all
>> of them and then I start again corosync, everything work fine again.
>>
>> I don't know what is happening and I'm not able to reproduce the same
>> situation on some virtual servers!
>>
>> Thanks,
>> Giovanni
>>
>>
>>
>> the configuration of corosync is the following:
>>
>> ##############################################
>> # Please read the corosync.conf.5 manual page
>> compatibility: whitetank
>>
>> aisexec {
>> # Run as root - this is necessary to be able to manage resources
>> with Pacemaker
>> user: root
>> group: root
>> }
>>
>> service {
>> # Load the Pacemaker Cluster Resource Manager
>> ver: 0
>> name: pacemaker
>> use_mgmtd: yes
>> use_logd: yes
>> }
>>
>> totem {
>> version: 2
>>
>> # How long before declaring a token lost (ms)
>> token: 5000
>>
>> # How many token retransmits before forming a new configuration
>> token_retransmits_before_loss_const: 10
>>
>> # How long to wait for join messages in the membership protocol
>> (ms)
>> join: 1000
>>
>> # How long to wait for consensus to be achieved before starting a
>> new round of membership configuration (ms)
>> consensus: 2500
>>
>> # Turn off the virtual synchrony filter
>> vsftype: none
>>
>> # Number of messages that may be sent by one processor on receipt
>> of the token
>> max_messages: 20
>>
>> # Stagger sending the node join messages by 1..send_join ms
>> send_join: 45
>>
>> # Limit generated nodeids to 31-bits (positive signed integers)
>> clear_node_high_bit: yes
>>
>> # Disable encryption
>> secauth: off
>>
>> # How many threads to use for encryption/decryption
>> threads: 0
>>
>> # Optionally assign a fixed node id (integer)
>> # nodeid: 1234
>>
>> interface {
>> ringnumber: 0
>>
>> # The following values need to be set based on your
>> environment
>> bindnetaddr: XXX.XXX.XXX.0 #here I put the right ip for my configuration
>> mcastaddr: 226.94.1.1
>> mcastport: 4000
>> }
>> }
>>
>> logging {
>> fileline: off
>> to_stderr: yes
>> to_logfile: yes
>> to_syslog: yes
>> logfile: /tmp/corosync.log
>> debug: off
>> timestamp: on
>> logger_subsys {
>> subsys: AMF
>> debug: off
>> }
>> }
>>
>> amf {
>> mode: disabled
>> }
>>
>> ##################################################
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>
>
>
> --
> Dream with longterm vision!
> kerdosa
> _______________________________________________
> Pacemaker mailing list
> Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>


--
Dream with longterm vision!
kerdosa


gdimilia at cfa

Nov 17, 2009, 9:07 AM

Post #5 of 10 (589 views)
Permalink
Re: Multiple thread after rebooting server: the node doesn't go online [In reply to]

Disabling syslog the problem disappears.

Thank you very much,
Giovanni



On Nov 16, 2009, at 4:51 PM, hj lee wrote:

> Hi,
>
> Please disable syslog in openais.conf, and try it again. It seems
> this issue is related to fork() call and syslog().
>
> hj
>
> On Fri, Nov 13, 2009 at 1:08 PM, Giovanni Di Milia <gdimilia [at] cfa
> > wrote:
> Thank you very much for your response.
>
> The only thing I really don't understand is: why this problem
> doesn't appear in all my simulations?
> I configured at least 7 couple of virtual servers with vmware 2 and
> CentOS 5.3 and 5.4 (32 and 64 bits) and I never had this kind of
> problems!
>
> The only difference in the configuration is that I used private IPs
> for the simulations and public IPs for the real servers, but I don't
> think it is important.
>
> Thanks for your patience,
> Giovanni
>
>
>
> On Nov 13, 2009, at 1:36 PM, hj lee wrote:
>
>> Hi,
>>
>> I have the same problem in CentOS 5.3 with pacemaker-1.0.5 and
>> openais-0.80.5. This is openais bug! Two problems.
>> 1. Starting openais service gets seg fault sometime. It more likely
>> happens if openais service get started before syslog.
>> 2. The seg fault handler of openais calls syslog(). The syslog is
>> one of UNSAFE function that must not be called from signal handler
>> because it is non-reentrent function.
>>
>> To fix this issue: get the openais source, find sigsegv_handler
>> function exec/main.c and just comment out log_flush(), shown below.
>> Then recompile and isntall it(make and make install). The log_flush
>> should be removed from all signal handlers in openais code base. I
>> am still not sure where seg fault occurs, but commenting out
>> log_flush prevents seg fault.
>>
>>
>> -------------------------------------------------------------------------
>> static void sigsegv_handler (int num)
>> {
>> signal (SIGSEGV, SIG_DFL);
>> // log_flush ();
>> raise (SIGSEGV);
>> }
>>
>> Thanks
>> hj
>>
>> On Thu, Nov 12, 2009 at 4:21 PM, Giovanni Di Milia <gdimilia [at] cfa
>> > wrote:
>> I set up a cluster of two servers CentOS 5.4 x86_64 with pacemaker
>> 1.06 and corosync 1.1.2
>>
>> I only installed the x86_64 packages (yum install pacemaker try to
>> install also the 32 bits one).
>>
>> I configured a shared cluster IP (it's a public ip) and a cluster
>> website.
>>
>> Everything work fine if i try to stop corosync on one of the two
>> servers (the services pass from one machine to the other without
>> problems), but if I reboot one server, when it returns alive it
>> cannot go online in the cluster.
>> I also noticed that there are several thread of corosync and if I
>> kill all of them and then I start again corosync, everything work
>> fine again.
>>
>> I don't know what is happening and I'm not able to reproduce the
>> same situation on some virtual servers!
>>
>> Thanks,
>> Giovanni
>>
>>
>>
>> the configuration of corosync is the following:
>>
>> ##############################################
>> # Please read the corosync.conf.5 manual page
>> compatibility: whitetank
>>
>> aisexec {
>> # Run as root - this is necessary to be able to manage
>> resources with Pacemaker
>> user: root
>> group: root
>> }
>>
>> service {
>> # Load the Pacemaker Cluster Resource Manager
>> ver: 0
>> name: pacemaker
>> use_mgmtd: yes
>> use_logd: yes
>> }
>>
>> totem {
>> version: 2
>>
>> # How long before declaring a token lost (ms)
>> token: 5000
>>
>> # How many token retransmits before forming a new
>> configuration
>> token_retransmits_before_loss_const: 10
>>
>> # How long to wait for join messages in the membership
>> protocol (ms)
>> join: 1000
>>
>> # How long to wait for consensus to be achieved before
>> starting a new round of membership configuration (ms)
>> consensus: 2500
>>
>> # Turn off the virtual synchrony filter
>> vsftype: none
>>
>> # Number of messages that may be sent by one processor on
>> receipt of the token
>> max_messages: 20
>>
>> # Stagger sending the node join messages by 1..send_join ms
>> send_join: 45
>>
>> # Limit generated nodeids to 31-bits (positive signed
>> integers)
>> clear_node_high_bit: yes
>>
>> # Disable encryption
>> secauth: off
>>
>> # How many threads to use for encryption/decryption
>> threads: 0
>>
>> # Optionally assign a fixed node id (integer)
>> # nodeid: 1234
>>
>> interface {
>> ringnumber: 0
>>
>> # The following values need to be set based on your
>> environment
>> bindnetaddr: XXX.XXX.XXX.0 #here I put the right ip for my
>> configuration
>> mcastaddr: 226.94.1.1
>> mcastport: 4000
>> }
>> }
>>
>> logging {
>> fileline: off
>> to_stderr: yes
>> to_logfile: yes
>> to_syslog: yes
>> logfile: /tmp/corosync.log
>> debug: off
>> timestamp: on
>> logger_subsys {
>> subsys: AMF
>> debug: off
>> }
>> }
>>
>> amf {
>> mode: disabled
>> }
>>
>> ##################################################
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>
>>
>> --
>> Dream with longterm vision!
>> kerdosa
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
>
>
> --
> Dream with longterm vision!
> kerdosa
> _______________________________________________
> Pacemaker mailing list
> Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


gdimilia at cfa

Nov 17, 2009, 1:31 PM

Post #6 of 10 (590 views)
Permalink
Re: Multiple thread after rebooting server: the node doesn't go online [In reply to]

Another problem has appeared:
after the reboot of one server I often have a cluster partition and
both servers elect themselves DC.

Even if the partition doesn't appear just after the reboot of one
server (i.e. serverA), if I try to restart corosync on the other
server (i.e. serverB), the partition appear.
Then if I also restart corosync on the first server (serverA)
everything work fine again.
But if I restart corosync on the second server (serverB) nothing
change and the partition appears again.

It's seems to me that there is still something wrong with the first
run of corosync just after the server reboot.

I didn't configure any fencing method, because I think that my
configuration is really simple and I don't need it.

Thanks again for your patience,
Giovanni


On Nov 17, 2009, at 12:07 PM, Giovanni Di Milia wrote:

> Disabling syslog the problem disappears.
>
> Thank you very much,
> Giovanni
>
>
>
> On Nov 16, 2009, at 4:51 PM, hj lee wrote:
>
>> Hi,
>>
>> Please disable syslog in openais.conf, and try it again. It seems
>> this issue is related to fork() call and syslog().
>>
>> hj
>>
>> On Fri, Nov 13, 2009 at 1:08 PM, Giovanni Di Milia <gdimilia [at] cfa
>> > wrote:
>> Thank you very much for your response.
>>
>> The only thing I really don't understand is: why this problem
>> doesn't appear in all my simulations?
>> I configured at least 7 couple of virtual servers with vmware 2 and
>> CentOS 5.3 and 5.4 (32 and 64 bits) and I never had this kind of
>> problems!
>>
>> The only difference in the configuration is that I used private IPs
>> for the simulations and public IPs for the real servers, but I
>> don't think it is important.
>>
>> Thanks for your patience,
>> Giovanni
>>
>>
>>
>> On Nov 13, 2009, at 1:36 PM, hj lee wrote:
>>
>>> Hi,
>>>
>>> I have the same problem in CentOS 5.3 with pacemaker-1.0.5 and
>>> openais-0.80.5. This is openais bug! Two problems.
>>> 1. Starting openais service gets seg fault sometime. It more
>>> likely happens if openais service get started before syslog.
>>> 2. The seg fault handler of openais calls syslog(). The syslog is
>>> one of UNSAFE function that must not be called from signal handler
>>> because it is non-reentrent function.
>>>
>>> To fix this issue: get the openais source, find sigsegv_handler
>>> function exec/main.c and just comment out log_flush(), shown
>>> below. Then recompile and isntall it(make and make install). The
>>> log_flush should be removed from all signal handlers in openais
>>> code base. I am still not sure where seg fault occurs, but
>>> commenting out log_flush prevents seg fault.
>>>
>>>
>>> -------------------------------------------------------------------------
>>> static void sigsegv_handler (int num)
>>> {
>>> signal (SIGSEGV, SIG_DFL);
>>> // log_flush ();
>>> raise (SIGSEGV);
>>> }
>>>
>>> Thanks
>>> hj
>>>
>>> On Thu, Nov 12, 2009 at 4:21 PM, Giovanni Di Milia <gdimilia [at] cfa
>>> > wrote:
>>> I set up a cluster of two servers CentOS 5.4 x86_64 with pacemaker
>>> 1.06 and corosync 1.1.2
>>>
>>> I only installed the x86_64 packages (yum install pacemaker try to
>>> install also the 32 bits one).
>>>
>>> I configured a shared cluster IP (it's a public ip) and a cluster
>>> website.
>>>
>>> Everything work fine if i try to stop corosync on one of the two
>>> servers (the services pass from one machine to the other without
>>> problems), but if I reboot one server, when it returns alive it
>>> cannot go online in the cluster.
>>> I also noticed that there are several thread of corosync and if I
>>> kill all of them and then I start again corosync, everything work
>>> fine again.
>>>
>>> I don't know what is happening and I'm not able to reproduce the
>>> same situation on some virtual servers!
>>>
>>> Thanks,
>>> Giovanni
>>>
>>>
>>>
>>> the configuration of corosync is the following:
>>>
>>> ##############################################
>>> # Please read the corosync.conf.5 manual page
>>> compatibility: whitetank
>>>
>>> aisexec {
>>> # Run as root - this is necessary to be able to manage
>>> resources with Pacemaker
>>> user: root
>>> group: root
>>> }
>>>
>>> service {
>>> # Load the Pacemaker Cluster Resource Manager
>>> ver: 0
>>> name: pacemaker
>>> use_mgmtd: yes
>>> use_logd: yes
>>> }
>>>
>>> totem {
>>> version: 2
>>>
>>> # How long before declaring a token lost (ms)
>>> token: 5000
>>>
>>> # How many token retransmits before forming a new
>>> configuration
>>> token_retransmits_before_loss_const: 10
>>>
>>> # How long to wait for join messages in the membership
>>> protocol (ms)
>>> join: 1000
>>>
>>> # How long to wait for consensus to be achieved before
>>> starting a new round of membership configuration (ms)
>>> consensus: 2500
>>>
>>> # Turn off the virtual synchrony filter
>>> vsftype: none
>>>
>>> # Number of messages that may be sent by one processor on
>>> receipt of the token
>>> max_messages: 20
>>>
>>> # Stagger sending the node join messages by 1..send_join ms
>>> send_join: 45
>>>
>>> # Limit generated nodeids to 31-bits (positive signed
>>> integers)
>>> clear_node_high_bit: yes
>>>
>>> # Disable encryption
>>> secauth: off
>>>
>>> # How many threads to use for encryption/decryption
>>> threads: 0
>>>
>>> # Optionally assign a fixed node id (integer)
>>> # nodeid: 1234
>>>
>>> interface {
>>> ringnumber: 0
>>>
>>> # The following values need to be set based on your
>>> environment
>>> bindnetaddr: XXX.XXX.XXX.0 #here I put the right ip for my
>>> configuration
>>> mcastaddr: 226.94.1.1
>>> mcastport: 4000
>>> }
>>> }
>>>
>>> logging {
>>> fileline: off
>>> to_stderr: yes
>>> to_logfile: yes
>>> to_syslog: yes
>>> logfile: /tmp/corosync.log
>>> debug: off
>>> timestamp: on
>>> logger_subsys {
>>> subsys: AMF
>>> debug: off
>>> }
>>> }
>>>
>>> amf {
>>> mode: disabled
>>> }
>>>
>>> ##################################################
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list
>>> Pacemaker [at] oss
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>
>>>
>>> --
>>> Dream with longterm vision!
>>> kerdosa
>>> _______________________________________________
>>> Pacemaker mailing list
>>> Pacemaker [at] oss
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>
>>
>>
>> --
>> Dream with longterm vision!
>> kerdosa
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


andrew at beekhof

Nov 19, 2009, 12:03 PM

Post #7 of 10 (588 views)
Permalink
Re: Multiple thread after rebooting server: the node doesn't go online [In reply to]

On Tue, Nov 17, 2009 at 10:31 PM, Giovanni Di Milia
<gdimilia [at] cfa> wrote:
> Another problem has appeared:
> after the reboot of one server I often have a cluster partition and both
> servers elect themselves DC.
> Even if the partition doesn't appear just after the reboot of one server
> (i.e. serverA), if I try to restart corosync on the other server (i.e.
> serverB), the partition appear.
> Then if I also restart corosync on the first server (serverA) everything
> work fine again.
> But if I restart corosync on the second server (serverB) nothing change and
> the partition appears again.
> It's seems to me that there is still something wrong with the first run of
> corosync just after the server reboot.

I've found that it starts a bit too early by default.
Various systems seem to like messing with the network stack (xen is
one but there are others) which confuses corosync.

You're not getting addresses from a dhcp server are you?
Thats another common cause, since there can be a significant delay in
obtaining the address - which again messes with corosync.

> I didn't configure any fencing method, because I think that my configuration
> is really simple and I don't need it.

Do you need your data though?

_______________________________________________
Pacemaker mailing list
Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


andrew at beekhof

Nov 19, 2009, 12:03 PM

Post #8 of 10 (576 views)
Permalink
Re: Multiple thread after rebooting server: the node doesn't go online [In reply to]

On Tue, Nov 17, 2009 at 10:31 PM, Giovanni Di Milia
<gdimilia [at] cfa> wrote:
> Another problem has appeared:
> after the reboot of one server I often have a cluster partition and both
> servers elect themselves DC.
> Even if the partition doesn't appear just after the reboot of one server
> (i.e. serverA), if I try to restart corosync on the other server (i.e.
> serverB), the partition appear.
> Then if I also restart corosync on the first server (serverA) everything
> work fine again.
> But if I restart corosync on the second server (serverB) nothing change and
> the partition appears again.
> It's seems to me that there is still something wrong with the first run of
> corosync just after the server reboot.

I've found that it starts a bit too early by default.
Various systems seem to like messing with the network stack (xen is
one but there are others) which confuses corosync.

You're not getting addresses from a dhcp server are you?
Thats another common cause, since there can be a significant delay in
obtaining the address - which again messes with corosync.

> I didn't configure any fencing method, because I think that my configuration
> is really simple and I don't need it.

Do you need your data though?

_______________________________________________
Pacemaker mailing list
Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


gdimilia at cfa

Nov 19, 2009, 12:40 PM

Post #9 of 10 (582 views)
Permalink
Re: Multiple thread after rebooting server: the node doesn't go online [In reply to]

On Nov 19, 2009, at 3:03 PM, Andrew Beekhof wrote:
>
>> Another problem has appeared:
>> after the reboot of one server I often have a cluster partition and
>> both
>> servers elect themselves DC.
>> Even if the partition doesn't appear just after the reboot of one
>> server
>> (i.e. serverA), if I try to restart corosync on the other server
>> (i.e.
>> serverB), the partition appear.
>> Then if I also restart corosync on the first server (serverA)
>> everything
>> work fine again.
>> But if I restart corosync on the second server (serverB) nothing
>> change and
>> the partition appears again.
>> It's seems to me that there is still something wrong with the first
>> run of
>> corosync just after the server reboot.
>
> I've found that it starts a bit too early by default.
> Various systems seem to like messing with the network stack (xen is
> one but there are others) which confuses corosync.

I wrote a shell script that "manually starts" corosync 5 minutes after
the server starts and in this case the problem appears every time!
It's driving me crazy, because I can see that my script starts a while
after the server is up and I'm pretty sure everything is running!
On the other hand, if I start manually corosync just after the server
is up, everything works fine!


> You're not getting addresses from a dhcp server are you?
> Thats another common cause, since there can be a significant delay in
> obtaining the address - which again messes with corosync.

Absolutely no!
I have two servers with static public IP.
I also added the two server in the /etc/hosts file: in general I
followed all the guidelines I found in the documentation.


>> I didn't configure any fencing method, because I think that my
>> configuration
>> is really simple and I don't need it.
>
> Do you need your data though?


Do you mean it's better to configure a fencing method anyway?

Thank you very much for your help!
Giovanni


andrew at beekhof

Nov 19, 2009, 11:20 PM

Post #10 of 10 (578 views)
Permalink
Re: Multiple thread after rebooting server: the node doesn't go online [In reply to]

On Thu, Nov 19, 2009 at 9:40 PM, Giovanni Di Milia
<gdimilia [at] cfa> wrote:
>
> On Nov 19, 2009, at 3:03 PM, Andrew Beekhof wrote:
>
> Another problem has appeared:
>
> after the reboot of one server I often have a cluster partition and both
>
> servers elect themselves DC.
>
> Even if the partition doesn't appear just after the reboot of one server
>
> (i.e. serverA), if I try to restart corosync on the other server (i.e.
>
> serverB), the partition appear.
>
> Then if I also restart corosync on the first server (serverA) everything
>
> work fine again.
>
> But if I restart corosync on the second server (serverB) nothing change and
>
> the partition appears again.
>
> It's seems to me that there is still something wrong with the first run of
>
> corosync just after the server reboot.
>
> I've found that it starts a bit too early by default.
> Various systems seem to like messing with the network stack (xen is
> one but there are others) which confuses corosync.
>
> I wrote a shell script that "manually starts" corosync 5 minutes after the
> server starts and in this case the problem appears every time!
> It's driving me crazy, because I can see that my script starts a while after
> the server is up and I'm pretty sure everything is running!
> On the other hand, if I start manually corosync just after the server is up,
> everything works fine!

i wonder if there is something in the environment.
perhaps have your script dump the output of
env | sort
to a file and compare to the logged in case.

>
> You're not getting addresses from a dhcp server are you?
> Thats another common cause, since there can be a significant delay in
> obtaining the address - which again messes with corosync.
>
> Absolutely no!
> I have two servers with static public IP.
> I also added the two server in the /etc/hosts file: in general I followed
> all the guidelines I found in the documentation.
>
> I didn't configure any fencing method, because I think that my configuration
>
> is really simple and I don't need it.
>
> Do you need your data though?
>
> Do you mean it's better to configure a fencing method anyway?

yes

_______________________________________________
Pacemaker mailing list
Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.