Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

Pengine behavior

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


vitaliy.davudov at vts24

Jul 11, 2012, 5:34 AM

Post #1 of 7 (407 views)
Permalink
Pengine behavior

Hi, list!

I have configured cluster for voip application.
Here my configuration:

# crm configure show
node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \
attributes standby="off"
node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \
attributes standby="off"
primitive FailoverIP1 ocf:heartbeat:IPaddr2 \
params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30"
nic="eth1.50" \
op monitor interval="1s"
primitive FailoverIP2 ocf:heartbeat:IPaddr2 \
params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16"
nic="eth1.554" \
op monitor interval="1s"
primitive FailoverIP3 ocf:heartbeat:IPaddr2 \
params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24"
nic="eth1.552" \
op monitor interval="1s"
primitive fs lsb:FSSofia \
op monitor interval="1s" enabled="false" timeout="2s"
on-fail="standby" \
meta target-role="Started"
group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \
meta target-role="Started"
order FS-after-IP inf: HAServices fs
property $id="cib-bootstrap-options" \
dc-version="1.0.12-unknown" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
expected-quorum-votes="1" \
no-quorum-policy="ignore" \
last-lrm-refresh="1299964019"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"

When 1-st node was crashed, then 2-nd node become active. During this
process in ha-debug file I found lines:

...
Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem:
Starting sub-system "pengine"
Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked:
/usr/lib64/heartbeat/pengine
*Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting pengine
Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover: Taking
over DC status for this partition*
Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite: We
are now in R/W mode
Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
Operation complete: op cib_master for section 'all'
(origin=local/crmd/11, version=0.391.20): ok (
rc=0)
Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
Operation complete: op cib_modify for section cib (origin=local/crmd/12,
version=0.391.20): ok (rc
=0)
Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/14, version=0.391.20):
ok (rc=0)
...

After "Starting pengine", only thru 4 seconds occured next action. What
happens at this time? Is it possible to reduce this time?

Thanks in advance.

--
Best regards,
Vitaly


dvossel at redhat

Jul 11, 2012, 11:40 AM

Post #2 of 7 (389 views)
Permalink
Re: Pengine behavior [In reply to]

----- Original Message -----
> From: "Виталий Давудов" <vitaliy.davudov [at] vts24>
> To: pacemaker [at] oss
> Sent: Wednesday, July 11, 2012 7:34:08 AM
> Subject: [Pacemaker] Pengine behavior
>
>
> Hi, list!
>
> I have configured cluster for voip application.
> Here my configuration:
>
> # crm configure show
> node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \
> attributes standby="off"

Ah... right here is your problem. You are using freeswitch instead of Asterisk :P

> node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \
> attributes standby="off"
> primitive FailoverIP1 ocf:heartbeat:IPaddr2 \
> params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30"
> nic="eth1.50" \
> op monitor interval="1s"
> primitive FailoverIP2 ocf:heartbeat:IPaddr2 \
> params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16"
> nic="eth1.554" \
> op monitor interval="1s"
> primitive FailoverIP3 ocf:heartbeat:IPaddr2 \
> params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24"
> nic="eth1.552" \
> op monitor interval="1s"
> primitive fs lsb:FSSofia \
> op monitor interval="1s" enabled="false" timeout="2s"
> on-fail="standby" \
> meta target-role="Started"
> group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \
> meta target-role="Started"
> order FS-after-IP inf: HAServices fs
> property $id="cib-bootstrap-options" \
> dc-version="1.0.12-unknown" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false" \
> expected-quorum-votes="1" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1299964019"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100"
>
> When 1-st node was crashed, then 2-nd node become active. During this
> process in ha-debug file I found lines:
>
> ...
> Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem:
> Starting sub-system "pengine"
> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked:
> /usr/lib64/heartbeat/pengine
> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting
> pengine
> Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover:
> Taking over DC status for this partition
> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite:
> We are now in R/W mode
> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
> Operation complete: op cib_master for section 'all'
> (origin=local/crmd/11, version=0.391.20): ok (
> rc=0)
> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
> Operation complete: op cib_modify for section cib
> (origin=local/crmd/12, version=0.391.20): ok (rc
> =0)
> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/14, version=0.391.20):
> ok (rc=0)
> ...
>
> After "Starting pengine", only thru 4 seconds occured next action.
> What happens at this time? Is it possible to reduce this time?

I seem to remember seeing something related to this in the code at one point. I believe it is limited only to the use of heartbeat as the messaging layer. After starting the pengine, the crmd sleeps waiting for the pengine to start before contacting it. The sleep is just a guess at how long it will take before the pengine will be up and ready to accept a connection though. That's why it is so long... so the gap will hopefully be large enough that no one will ever run into any problems with it (I am not a big fan of this type of logic at all) I'd recommend moving to corosync and seeing if this delay goes away.

-- Vossel

>
> Thanks in advance.
> --
> Best regards,
> Vitaly
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


vitaliy.davudov at vts24

Jul 11, 2012, 11:47 PM

Post #3 of 7 (390 views)
Permalink
Re: Pengine behavior [In reply to]

David, thanks for your answer!

I'll try to migrate to corosync.

11.07.2012 22:40, David Vossel пишет:
>
> ----- Original Message -----
>> From: "Виталий Давудов" <vitaliy.davudov [at] vts24>
>> To: pacemaker [at] oss
>> Sent: Wednesday, July 11, 2012 7:34:08 AM
>> Subject: [Pacemaker] Pengine behavior
>>
>>
>> Hi, list!
>>
>> I have configured cluster for voip application.
>> Here my configuration:
>>
>> # crm configure show
>> node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \
>> attributes standby="off"
> Ah... right here is your problem. You are using freeswitch instead of Asterisk :P
>
>> node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \
>> attributes standby="off"
>> primitive FailoverIP1 ocf:heartbeat:IPaddr2 \
>> params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30"
>> nic="eth1.50" \
>> op monitor interval="1s"
>> primitive FailoverIP2 ocf:heartbeat:IPaddr2 \
>> params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16"
>> nic="eth1.554" \
>> op monitor interval="1s"
>> primitive FailoverIP3 ocf:heartbeat:IPaddr2 \
>> params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24"
>> nic="eth1.552" \
>> op monitor interval="1s"
>> primitive fs lsb:FSSofia \
>> op monitor interval="1s" enabled="false" timeout="2s"
>> on-fail="standby" \
>> meta target-role="Started"
>> group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \
>> meta target-role="Started"
>> order FS-after-IP inf: HAServices fs
>> property $id="cib-bootstrap-options" \
>> dc-version="1.0.12-unknown" \
>> cluster-infrastructure="Heartbeat" \
>> stonith-enabled="false" \
>> expected-quorum-votes="1" \
>> no-quorum-policy="ignore" \
>> last-lrm-refresh="1299964019"
>> rsc_defaults $id="rsc-options" \
>> resource-stickiness="100"
>>
>> When 1-st node was crashed, then 2-nd node become active. During this
>> process in ha-debug file I found lines:
>>
>> ...
>> Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem:
>> Starting sub-system "pengine"
>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked:
>> /usr/lib64/heartbeat/pengine
>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting
>> pengine
>> Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover:
>> Taking over DC status for this partition
>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite:
>> We are now in R/W mode
>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>> Operation complete: op cib_master for section 'all'
>> (origin=local/crmd/11, version=0.391.20): ok (
>> rc=0)
>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>> Operation complete: op cib_modify for section cib
>> (origin=local/crmd/12, version=0.391.20): ok (rc
>> =0)
>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>> Operation complete: op cib_modify for section crm_config
>> (origin=local/crmd/14, version=0.391.20):
>> ok (rc=0)
>> ...
>>
>> After "Starting pengine", only thru 4 seconds occured next action.
>> What happens at this time? Is it possible to reduce this time?
> I seem to remember seeing something related to this in the code at one point. I believe it is limited only to the use of heartbeat as the messaging layer. After starting the pengine, the crmd sleeps waiting for the pengine to start before contacting it. The sleep is just a guess at how long it will take before the pengine will be up and ready to accept a connection though. That's why it is so long... so the gap will hopefully be large enough that no one will ever run into any problems with it (I am not a big fan of this type of logic at all) I'd recommend moving to corosync and seeing if this delay goes away.
>
> -- Vossel
>
>> Thanks in advance.
>> --
>> Best regards,
>> Vitaly
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

--
С наилучшими пожеланиями,
Давудов Виталий Федорович
ООО "ВИП-ТЕЛЕКОМ-СЕРВИС"
(Группа компаний "ETERIA")
http://www.vts24.ru
Тел: (495) 989-47-00




_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


vitaliy.davudov at vts24

Jul 19, 2012, 6:08 AM

Post #4 of 7 (384 views)
Permalink
Re: Pengine behavior [In reply to]

Hi!

I had moved my cluster from heartbeat to corosync.
Here corosync.conf content:

compatibility: whitetank

totem {
version: 2
token: 500
downcheck: 500
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.10.1.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}

logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync.log
debug: on
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

amf {
mode: disabled
}

quorum {
provider: corosync_votequorum
expected_votes: 1
}

Pacemaker configuration is not changed.

After first node crashed in corosync.log I can see that monitoring
stoped at 15:15:24 (i.e. node crashed at 15:15:24):

Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
monitor
Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:fs:16: monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: RA output:
(fs:monitor:stdout) OK
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
monitor
Jul 19 *15:53:24* freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
monitor
Jul 19 15:55:00 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'):
started and ready to provide service.
Jul 19 15:55:00 corosync [MAIN ] Corosync built-in features: nss rdma

On second node in corosync.log:

Jul 19 *15:53:27* corosync [TOTEM ] The token was lost in the
OPERATIONAL state.
Jul 19 15:53:27 corosync [TOTEM ] A processor failed, forming new
configuration.
Jul 19 15:53:27 corosync [TOTEM ] Receive multicast socket recv buffer
size (262142 bytes).
Jul 19 15:53:27 corosync [TOTEM ] Transmit multicast socket send buffer
size (262142 bytes).
Jul 19 15:53:27 corosync [TOTEM ] entering GATHER state from 2.
Jul 19 15:53:28 corosync [TOTEM ] entering GATHER state from 0.

I.e. second node detected crash after 3 secs.

Is there any way to reduce this amount of time?

Thanks in advance for all yours hints.

12.07.2012 10:47, Виталий Давудов пишет:
> David, thanks for your answer!
>
> I'll try to migrate to corosync.
>
> 11.07.2012 22:40, David Vossel пишет:
>>
>> ----- Original Message -----
>>> From: "Виталий Давудов" <vitaliy.davudov [at] vts24>
>>> To: pacemaker [at] oss
>>> Sent: Wednesday, July 11, 2012 7:34:08 AM
>>> Subject: [Pacemaker] Pengine behavior
>>>
>>>
>>> Hi, list!
>>>
>>> I have configured cluster for voip application.
>>> Here my configuration:
>>>
>>> # crm configure show
>>> node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \
>>> attributes standby="off"
>> Ah... right here is your problem. You are using freeswitch instead of
>> Asterisk :P
>>
>>> node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \
>>> attributes standby="off"
>>> primitive FailoverIP1 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30"
>>> nic="eth1.50" \
>>> op monitor interval="1s"
>>> primitive FailoverIP2 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16"
>>> nic="eth1.554" \
>>> op monitor interval="1s"
>>> primitive FailoverIP3 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24"
>>> nic="eth1.552" \
>>> op monitor interval="1s"
>>> primitive fs lsb:FSSofia \
>>> op monitor interval="1s" enabled="false" timeout="2s"
>>> on-fail="standby" \
>>> meta target-role="Started"
>>> group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \
>>> meta target-role="Started"
>>> order FS-after-IP inf: HAServices fs
>>> property $id="cib-bootstrap-options" \
>>> dc-version="1.0.12-unknown" \
>>> cluster-infrastructure="Heartbeat" \
>>> stonith-enabled="false" \
>>> expected-quorum-votes="1" \
>>> no-quorum-policy="ignore" \
>>> last-lrm-refresh="1299964019"
>>> rsc_defaults $id="rsc-options" \
>>> resource-stickiness="100"
>>>
>>> When 1-st node was crashed, then 2-nd node become active. During this
>>> process in ha-debug file I found lines:
>>>
>>> ...
>>> Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem:
>>> Starting sub-system "pengine"
>>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked:
>>> /usr/lib64/heartbeat/pengine
>>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting
>>> pengine
>>> Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover:
>>> Taking over DC status for this partition
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite:
>>> We are now in R/W mode
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_master for section 'all'
>>> (origin=local/crmd/11, version=0.391.20): ok (
>>> rc=0)
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_modify for section cib
>>> (origin=local/crmd/12, version=0.391.20): ok (rc
>>> =0)
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_modify for section crm_config
>>> (origin=local/crmd/14, version=0.391.20):
>>> ok (rc=0)
>>> ...
>>>
>>> After "Starting pengine", only thru 4 seconds occured next action.
>>> What happens at this time? Is it possible to reduce this time?
>> I seem to remember seeing something related to this in the code at
>> one point. I believe it is limited only to the use of heartbeat as
>> the messaging layer. After starting the pengine, the crmd sleeps
>> waiting for the pengine to start before contacting it. The sleep is
>> just a guess at how long it will take before the pengine will be up
>> and ready to accept a connection though. That's why it is so long...
>> so the gap will hopefully be large enough that no one will ever run
>> into any problems with it (I am not a big fan of this type of logic
>> at all) I'd recommend moving to corosync and seeing if this delay
>> goes away.
>>
>> -- Vossel
>>
>>> Thanks in advance.
>>> --
>>> Best regards,
>>> Vitaly
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>

--
Best regards,
Vitaly


dvossel at redhat

Jul 19, 2012, 7:43 AM

Post #5 of 7 (376 views)
Permalink
Re: Pengine behavior [In reply to]

----- Original Message -----
> From: "Виталий Давудов" <vitaliy.davudov [at] vts24>
> To: "The Pacemaker cluster resource manager" <pacemaker [at] oss>
> Sent: Thursday, July 19, 2012 8:08:12 AM
> Subject: Re: [Pacemaker] Pengine behavior
>
>
> Hi!
>
> I had moved my cluster from heartbeat to corosync.
> Here corosync.conf content:
>
> compatibility: whitetank
>
> totem {
> version: 2
> token: 500
> downcheck: 500
> secauth: off
> threads: 0
> interface {
> ringnumber: 0
> bindnetaddr: 10.10.1.0
> mcastaddr: 226.94.1.1
> mcastport: 5405
> }
> }
>
> logging {
> fileline: off
> to_stderr: no
> to_logfile: yes
> to_syslog: yes
> logfile: /var/log/corosync.log
> debug: on
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> quorum {
> provider: corosync_votequorum
> expected_votes: 1
> }
>
> Pacemaker configuration is not changed.
>
> After first node crashed in corosync.log I can see that monitoring
> stoped at 15:15:24 (i.e. node crashed at 15:15:24 ):
>
> Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
> monitor
> Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
> monitor
> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
> monitor
> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:fs:16: monitor
> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: RA output:
> (fs:monitor:stdout) OK
> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
> monitor
> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
> monitor
> Jul 19 15:53:24 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
> monitor
> Jul 19 15:55:00 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'):
> started and ready to provide service.
> Jul 19 15:55:00 corosync [MAIN ] Corosync built-in features: nss rdma
>
> On second node in corosync.log:
>
> Jul 19 15:53:27 corosync [TOTEM ] The token was lost in the
> OPERATIONAL state.
> Jul 19 15:53:27 corosync [TOTEM ] A processor failed, forming new
> configuration.
> Jul 19 15:53:27 corosync [TOTEM ] Receive multicast socket recv
> buffer size (262142 bytes).
> Jul 19 15:53:27 corosync [TOTEM ] Transmit multicast socket send
> buffer size (262142 bytes).
> Jul 19 15:53:27 corosync [TOTEM ] entering GATHER state from 2.
> Jul 19 15:53:28 corosync [TOTEM ] entering GATHER state from 0.
>
> I.e. second node detected crash after 3 secs.
>
> Is there any way to reduce this amount of time?
>

Are you trying to do active call failover or something? How quickly do you need this failure detected? Are you hoping the failover will just be a blip in the audio? There may be a way to monitor the node more aggressively with some sort of ping.. but less that 3 seconds is very aggressive.

I haven't dealt with trying to optimize this to the point you are probably needing. Hopefully someone else has some ideas. I'm sure you have more potential for optimization using the corosync stack though.

-- Vossel

> Thanks in advance for all yours hints.
>
>
> 12.07.2012 10:47, Виталий Давудов пишет:
>
>
> David, thanks for your answer!
>
> I'll try to migrate to corosync.
>
> 11.07.2012 22:40, David Vossel пишет:
>
>
>
> ----- Original Message -----
>
>
> From: "Виталий Давудов" <vitaliy.davudov [at] vts24>
> To: pacemaker [at] oss
> Sent: Wednesday, July 11, 2012 7:34:08 AM
> Subject: [Pacemaker] Pengine behavior
>
>
> Hi, list!
>
> I have configured cluster for voip application.
> Here my configuration:
>
> # crm configure show
> node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \
> attributes standby="off"
> Ah... right here is your problem. You are using freeswitch instead of
> Asterisk :P
>
>
>
> node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \
> attributes standby="off"
> primitive FailoverIP1 ocf:heartbeat:IPaddr2 \
> params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30"
> nic="eth1.50" \
> op monitor interval="1s"
> primitive FailoverIP2 ocf:heartbeat:IPaddr2 \
> params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16"
> nic="eth1.554" \
> op monitor interval="1s"
> primitive FailoverIP3 ocf:heartbeat:IPaddr2 \
> params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24"
> nic="eth1.552" \
> op monitor interval="1s"
> primitive fs lsb:FSSofia \
> op monitor interval="1s" enabled="false" timeout="2s"
> on-fail="standby" \
> meta target-role="Started"
> group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \
> meta target-role="Started"
> order FS-after-IP inf: HAServices fs
> property $id="cib-bootstrap-options" \
> dc-version="1.0.12-unknown" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false" \
> expected-quorum-votes="1" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1299964019"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100"
>
> When 1-st node was crashed, then 2-nd node become active. During this
> process in ha-debug file I found lines:
>
> ...
> Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem:
> Starting sub-system "pengine"
> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked:
> /usr/lib64/heartbeat/pengine
> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting
> pengine
> Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover:
> Taking over DC status for this partition
> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite:
> We are now in R/W mode
> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
> Operation complete: op cib_master for section 'all'
> (origin=local/crmd/11, version=0.391.20): ok (
> rc=0)
> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
> Operation complete: op cib_modify for section cib
> (origin=local/crmd/12, version=0.391.20): ok (rc
> =0)
> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/14, version=0.391.20):
> ok (rc=0)
> ...
>
> After "Starting pengine", only thru 4 seconds occured next action.
> What happens at this time? Is it possible to reduce this time?
> I seem to remember seeing something related to this in the code at
> one point. I believe it is limited only to the use of heartbeat as
> the messaging layer. After starting the pengine, the crmd sleeps
> waiting for the pengine to start before contacting it. The sleep is
> just a guess at how long it will take before the pengine will be up
> and ready to accept a connection though. That's why it is so long...
> so the gap will hopefully be large enough that no one will ever run
> into any problems with it (I am not a big fan of this type of logic
> at all) I'd recommend moving to corosync and seeing if this delay
> goes away.
>
> -- Vossel
>
>
>
> Thanks in advance.
> --
> Best regards,
> Vitaly
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> --
> Best regards,
> Vitaly
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


vitaliy.davudov at vts24

Jul 20, 2012, 1:39 AM

Post #6 of 7 (365 views)
Permalink
Re: Pengine behavior [In reply to]

Hi, David!

Yes, you are right, I'm trying to do active call failover. I hope to
achieve 3 secs silence during the call (now it's 5 secs). If there is
any kind of directive in corosync to monitor the node more aggressively
(every 1 sec), I'll be very happy.


19.07.2012 18:43, David Vossel пишет:
> ----- Original Message -----
>> From: "Виталий Давудов" <vitaliy.davudov [at] vts24>
>> To: "The Pacemaker cluster resource manager" <pacemaker [at] oss>
>> Sent: Thursday, July 19, 2012 8:08:12 AM
>> Subject: Re: [Pacemaker] Pengine behavior
>>
>>
>> Hi!
>>
>> I had moved my cluster from heartbeat to corosync.
>> Here corosync.conf content:
>>
>> compatibility: whitetank
>>
>> totem {
>> version: 2
>> token: 500
>> downcheck: 500
>> secauth: off
>> threads: 0
>> interface {
>> ringnumber: 0
>> bindnetaddr: 10.10.1.0
>> mcastaddr: 226.94.1.1
>> mcastport: 5405
>> }
>> }
>>
>> logging {
>> fileline: off
>> to_stderr: no
>> to_logfile: yes
>> to_syslog: yes
>> logfile: /var/log/corosync.log
>> debug: on
>> timestamp: on
>> logger_subsys {
>> subsys: AMF
>> debug: off
>> }
>> }
>>
>> amf {
>> mode: disabled
>> }
>>
>> quorum {
>> provider: corosync_votequorum
>> expected_votes: 1
>> }
>>
>> Pacemaker configuration is not changed.
>>
>> After first node crashed in corosync.log I can see that monitoring
>> stoped at 15:15:24 (i.e. node crashed at 15:15:24 ):
>>
>> Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
>> monitor
>> Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
>> monitor
>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
>> monitor
>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:fs:16: monitor
>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: RA output:
>> (fs:monitor:stdout) OK
>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
>> monitor
>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
>> monitor
>> Jul 19 15:53:24 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
>> monitor
>> Jul 19 15:55:00 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'):
>> started and ready to provide service.
>> Jul 19 15:55:00 corosync [MAIN ] Corosync built-in features: nss rdma
>>
>> On second node in corosync.log:
>>
>> Jul 19 15:53:27 corosync [TOTEM ] The token was lost in the
>> OPERATIONAL state.
>> Jul 19 15:53:27 corosync [TOTEM ] A processor failed, forming new
>> configuration.
>> Jul 19 15:53:27 corosync [TOTEM ] Receive multicast socket recv
>> buffer size (262142 bytes).
>> Jul 19 15:53:27 corosync [TOTEM ] Transmit multicast socket send
>> buffer size (262142 bytes).
>> Jul 19 15:53:27 corosync [TOTEM ] entering GATHER state from 2.
>> Jul 19 15:53:28 corosync [TOTEM ] entering GATHER state from 0.
>>
>> I.e. second node detected crash after 3 secs.
>>
>> Is there any way to reduce this amount of time?
>>
> Are you trying to do active call failover or something? How quickly do you need this failure detected? Are you hoping the failover will just be a blip in the audio? There may be a way to monitor the node more aggressively with some sort of ping.. but less that 3 seconds is very aggressive.
>
> I haven't dealt with trying to optimize this to the point you are probably needing. Hopefully someone else has some ideas. I'm sure you have more potential for optimization using the corosync stack though.
>
> -- Vossel
>
>> Thanks in advance for all yours hints.
>>
>>
>> 12.07.2012 10:47, Виталий Давудов пишет:
>>
>>
>> David, thanks for your answer!
>>
>> I'll try to migrate to corosync.
>>
>> 11.07.2012 22:40, David Vossel пишет:
>>
>>
>>
>> ----- Original Message -----
>>
>>
>> From: "Виталий Давудов" <vitaliy.davudov [at] vts24>
>> To: pacemaker [at] oss
>> Sent: Wednesday, July 11, 2012 7:34:08 AM
>> Subject: [Pacemaker] Pengine behavior
>>
>>
>> Hi, list!
>>
>> I have configured cluster for voip application.
>> Here my configuration:
>>
>> # crm configure show
>> node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \
>> attributes standby="off"
>> Ah... right here is your problem. You are using freeswitch instead of
>> Asterisk :P
>>
>>
>>
>> node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \
>> attributes standby="off"
>> primitive FailoverIP1 ocf:heartbeat:IPaddr2 \
>> params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30"
>> nic="eth1.50" \
>> op monitor interval="1s"
>> primitive FailoverIP2 ocf:heartbeat:IPaddr2 \
>> params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16"
>> nic="eth1.554" \
>> op monitor interval="1s"
>> primitive FailoverIP3 ocf:heartbeat:IPaddr2 \
>> params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24"
>> nic="eth1.552" \
>> op monitor interval="1s"
>> primitive fs lsb:FSSofia \
>> op monitor interval="1s" enabled="false" timeout="2s"
>> on-fail="standby" \
>> meta target-role="Started"
>> group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \
>> meta target-role="Started"
>> order FS-after-IP inf: HAServices fs
>> property $id="cib-bootstrap-options" \
>> dc-version="1.0.12-unknown" \
>> cluster-infrastructure="Heartbeat" \
>> stonith-enabled="false" \
>> expected-quorum-votes="1" \
>> no-quorum-policy="ignore" \
>> last-lrm-refresh="1299964019"
>> rsc_defaults $id="rsc-options" \
>> resource-stickiness="100"
>>
>> When 1-st node was crashed, then 2-nd node become active. During this
>> process in ha-debug file I found lines:
>>
>> ...
>> Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem:
>> Starting sub-system "pengine"
>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked:
>> /usr/lib64/heartbeat/pengine
>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting
>> pengine
>> Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover:
>> Taking over DC status for this partition
>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite:
>> We are now in R/W mode
>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>> Operation complete: op cib_master for section 'all'
>> (origin=local/crmd/11, version=0.391.20): ok (
>> rc=0)
>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>> Operation complete: op cib_modify for section cib
>> (origin=local/crmd/12, version=0.391.20): ok (rc
>> =0)
>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>> Operation complete: op cib_modify for section crm_config
>> (origin=local/crmd/14, version=0.391.20):
>> ok (rc=0)
>> ...
>>
>> After "Starting pengine", only thru 4 seconds occured next action.
>> What happens at this time? Is it possible to reduce this time?
>> I seem to remember seeing something related to this in the code at
>> one point. I believe it is limited only to the use of heartbeat as
>> the messaging layer. After starting the pengine, the crmd sleeps
>> waiting for the pengine to start before contacting it. The sleep is
>> just a guess at how long it will take before the pengine will be up
>> and ready to accept a connection though. That's why it is so long...
>> so the gap will hopefully be large enough that no one will ever run
>> into any problems with it (I am not a big fan of this type of logic
>> at all) I'd recommend moving to corosync and seeing if this delay
>> goes away.
>>
>> -- Vossel
>>
>>
>>
>> Thanks in advance.
>> --
>> Best regards,
>> Vitaly
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> --
>> Best regards,
>> Vitaly
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

--

Best regards,
Vitaly



_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andrew at beekhof

Jul 29, 2012, 5:59 PM

Post #7 of 7 (339 views)
Permalink
Re: Pengine behavior [In reply to]

On Fri, Jul 20, 2012 at 6:39 PM,
<vitaliy.davudov [at] vts24> wrote:
> Hi, David!
>
> Yes, you are right, I'm trying to do active call failover. I hope to achieve
> 3 secs silence during the call (now it's 5 secs). If there is any kind of
> directive in corosync to monitor the node more aggressively (every 1 sec),
> I'll be very happy.


man corosync.conf has a few. I'm guessing you need to further tune
one or more of
token
token_retransmit
hold
token_retransmits_before_loss_const
join
send_join
consensus
merge
downcheck
fail_recv_const



>
>
> 19.07.2012 18:43, David Vossel :
>
>> ----- Original Message -----
>>>
>>> From: " " <vitaliy.davudov [at] vts24>
>>> To: "The Pacemaker cluster resource manager"
>>> <pacemaker [at] oss>
>>> Sent: Thursday, July 19, 2012 8:08:12 AM
>>> Subject: Re: [Pacemaker] Pengine behavior
>>>
>>>
>>> Hi!
>>>
>>> I had moved my cluster from heartbeat to corosync.
>>> Here corosync.conf content:
>>>
>>> compatibility: whitetank
>>>
>>> totem {
>>> version: 2
>>> token: 500
>>> downcheck: 500
>>> secauth: off
>>> threads: 0
>>> interface {
>>> ringnumber: 0
>>> bindnetaddr: 10.10.1.0
>>> mcastaddr: 226.94.1.1
>>> mcastport: 5405
>>> }
>>> }
>>>
>>> logging {
>>> fileline: off
>>> to_stderr: no
>>> to_logfile: yes
>>> to_syslog: yes
>>> logfile: /var/log/corosync.log
>>> debug: on
>>> timestamp: on
>>> logger_subsys {
>>> subsys: AMF
>>> debug: off
>>> }
>>> }
>>>
>>> amf {
>>> mode: disabled
>>> }
>>>
>>> quorum {
>>> provider: corosync_votequorum
>>> expected_votes: 1
>>> }
>>>
>>> Pacemaker configuration is not changed.
>>>
>>> After first node crashed in corosync.log I can see that monitoring
>>> stoped at 15:15:24 (i.e. node crashed at 15:15:24 ):
>>>
>>> Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
>>> monitor
>>> Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
>>> monitor
>>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
>>> monitor
>>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:fs:16: monitor
>>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: RA output:
>>> (fs:monitor:stdout) OK
>>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
>>> monitor
>>> Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
>>> monitor
>>> Jul 19 15:53:24 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
>>> monitor
>>> Jul 19 15:55:00 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'):
>>> started and ready to provide service.
>>> Jul 19 15:55:00 corosync [MAIN ] Corosync built-in features: nss rdma
>>>
>>> On second node in corosync.log:
>>>
>>> Jul 19 15:53:27 corosync [TOTEM ] The token was lost in the
>>> OPERATIONAL state.
>>> Jul 19 15:53:27 corosync [TOTEM ] A processor failed, forming new
>>> configuration.
>>> Jul 19 15:53:27 corosync [TOTEM ] Receive multicast socket recv
>>> buffer size (262142 bytes).
>>> Jul 19 15:53:27 corosync [TOTEM ] Transmit multicast socket send
>>> buffer size (262142 bytes).
>>> Jul 19 15:53:27 corosync [TOTEM ] entering GATHER state from 2.
>>> Jul 19 15:53:28 corosync [TOTEM ] entering GATHER state from 0.
>>>
>>> I.e. second node detected crash after 3 secs.
>>>
>>> Is there any way to reduce this amount of time?
>>>
>> Are you trying to do active call failover or something? How quickly do
>> you need this failure detected? Are you hoping the failover will just be a
>> blip in the audio? There may be a way to monitor the node more aggressively
>> with some sort of ping.. but less that 3 seconds is very aggressive.
>>
>> I haven't dealt with trying to optimize this to the point you are probably
>> needing. Hopefully someone else has some ideas. I'm sure you have more
>> potential for optimization using the corosync stack though.
>>
>> -- Vossel
>>
>>> Thanks in advance for all yours hints.
>>>
>>>
>>> 12.07.2012 10:47, :
>>>
>>>
>>> David, thanks for your answer!
>>>
>>> I'll try to migrate to corosync.
>>>
>>> 11.07.2012 22:40, David Vossel :
>>>
>>>
>>>
>>> ----- Original Message -----
>>>
>>>
>>> From: " " <vitaliy.davudov [at] vts24>
>>> To: pacemaker [at] oss
>>> Sent: Wednesday, July 11, 2012 7:34:08 AM
>>> Subject: [Pacemaker] Pengine behavior
>>>
>>>
>>> Hi, list!
>>>
>>> I have configured cluster for voip application.
>>> Here my configuration:
>>>
>>> # crm configure show
>>> node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \
>>> attributes standby="off"
>>> Ah... right here is your problem. You are using freeswitch instead of
>>> Asterisk :P
>>>
>>>
>>>
>>> node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \
>>> attributes standby="off"
>>> primitive FailoverIP1 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30"
>>> nic="eth1.50" \
>>> op monitor interval="1s"
>>> primitive FailoverIP2 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16"
>>> nic="eth1.554" \
>>> op monitor interval="1s"
>>> primitive FailoverIP3 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24"
>>> nic="eth1.552" \
>>> op monitor interval="1s"
>>> primitive fs lsb:FSSofia \
>>> op monitor interval="1s" enabled="false" timeout="2s"
>>> on-fail="standby" \
>>> meta target-role="Started"
>>> group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \
>>> meta target-role="Started"
>>> order FS-after-IP inf: HAServices fs
>>> property $id="cib-bootstrap-options" \
>>> dc-version="1.0.12-unknown" \
>>> cluster-infrastructure="Heartbeat" \
>>> stonith-enabled="false" \
>>> expected-quorum-votes="1" \
>>> no-quorum-policy="ignore" \
>>> last-lrm-refresh="1299964019"
>>> rsc_defaults $id="rsc-options" \
>>> resource-stickiness="100"
>>>
>>> When 1-st node was crashed, then 2-nd node become active. During this
>>> process in ha-debug file I found lines:
>>>
>>> ...
>>> Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem:
>>> Starting sub-system "pengine"
>>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked:
>>> /usr/lib64/heartbeat/pengine
>>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting
>>> pengine
>>> Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover:
>>> Taking over DC status for this partition
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite:
>>> We are now in R/W mode
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_master for section 'all'
>>> (origin=local/crmd/11, version=0.391.20): ok (
>>> rc=0)
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_modify for section cib
>>> (origin=local/crmd/12, version=0.391.20): ok (rc
>>> =0)
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_modify for section crm_config
>>> (origin=local/crmd/14, version=0.391.20):
>>> ok (rc=0)
>>> ...
>>>
>>> After "Starting pengine", only thru 4 seconds occured next action.
>>> What happens at this time? Is it possible to reduce this time?
>>> I seem to remember seeing something related to this in the code at
>>> one point. I believe it is limited only to the use of heartbeat as
>>> the messaging layer. After starting the pengine, the crmd sleeps
>>> waiting for the pengine to start before contacting it. The sleep is
>>> just a guess at how long it will take before the pengine will be up
>>> and ready to accept a connection though. That's why it is so long...
>>> so the gap will hopefully be large enough that no one will ever run
>>> into any problems with it (I am not a big fan of this type of logic
>>> at all) I'd recommend moving to corosync and seeing if this delay
>>> goes away.
>>>
>>> -- Vossel
>>>
>>>
>>>
>>> Thanks in advance.
>>> --
>>> Best regards,
>>> Vitaly
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> --
>>> Best regards,
>>> Vitaly
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> --
>
> Best regards,
> Vitaly
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.