Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

prevent the resource's start if it has "stop NG" history on the other node

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


tsukishima.ha at gmail

Feb 28, 2012, 11:32 PM

Post #1 of 12 (620 views)
Permalink
prevent the resource's start if it has "stop NG" history on the other node

Hi,

I'm running the following simple configuration with Pacemaker 1.1.6,
and try the test case, "resource stop NG and shutdown Pacemaker".

property \
no-quorum-policy="ignore" \
stonith-enabled="false" \
crmd-transition-delay="2s"

rsc_defaults \
resource-stickiness="INFINITY" \
migration-threshold="1"

primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="7s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"


"Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.

# diff -urNp Dummy Dummy-stop-NG
--- Dummy 2011-06-30 17:43:37.000000000 +0900
+++ Dummy-stop-NG 2012-02-28 19:11:12.850207767 +0900
@@ -108,6 +108,8 @@ dummy_start() {
}

dummy_stop() {
+ exit $OCF_ERR_GENERIC
+
dummy_monitor
if [ $? = $OCF_SUCCESS ]; then
rm ${OCF_RESKEY_state}



Before the test, the resource is running on "bl460g6a".

# crm_simulate -S -x pe-input-1.bz2

Current cluster status:
Online: [ bl460g6a bl460g6b ]

dummy01 (ocf::heartbeat:Dummy-stop-NG): Stopped

Transition Summary:
crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
dummy01 (bl460g6a)

Executing cluster transition:
* Executing action 6: dummy01_monitor_0 on bl460g6b
* Executing action 4: dummy01_monitor_0 on bl460g6a
* Executing action 7: dummy01_start_0 on bl460g6a
* Executing action 8: dummy01_monitor_7000 on bl460g6a

Revised cluster status:
Online: [ bl460g6a bl460g6b ]

dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a



Stop Pacemaker on "bl460g6a".
# service heartbeat stop

Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
# crm_simulate -S -x pe-input-2.bz2

Current cluster status:
Online: [ bl460g6a bl460g6b ]

dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a

Transition Summary:
crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
dummy01 (Started bl460g6a -> bl460g6b)

Executing cluster transition:
* Executing action 6: dummy01_stop_0 on bl460g6a
* Executing action 7: dummy01_start_0 on bl460g6b
* Executing action 8: dummy01_monitor_7000 on bl460g6b

Revised cluster status:
Online: [ bl460g6a bl460g6b ]

dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b



but this action will fail, it means the resource goes into unmanaged state.
# crm_simulate -S -x pe-input-3.bz2

Current cluster status:
Online: [ bl460g6a bl460g6b ]

dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
(unmanaged) FAILED

Transition Summary:

Executing cluster transition:

Revised cluster status:
Online: [ bl460g6a bl460g6b ]

dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
(unmanaged) FAILED



Pacemaker shutdown on "bl460g6a" becomes successful,
it seems that the following patch works well.
https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c

At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
might be running because it fails to stop.
In fact, the resource didn't start on "bl460g6b" after its stop NG and
"bl460g6a"'s shutdown, and this is an expectable behavior,
but I could start it on "bl460g6b" with crm command.
This holds the potential for the unexpected active/active status.
Is it possible to prevent it's start in this situation?
for example,
(1) Dummy runs on node-a
(2) Shutdown Pacemaker on node-a, and Dummy stop NG
(3) Dummy can not run on other nodes
(4) * cleanup the unmanaged status of Dummy after checking it's manual
operation on node-a
(5) * start Dummy on other nodes
This can be the safe way.

See attached hb_report.

Thanks,
Junko IKEDA

NTT DATA INTELLILINK CORPORATION
Attachments: hb_report.tar.bz2 (57.3 KB)


tsukishima.ha at gmail

Feb 29, 2012, 1:08 AM

Post #2 of 12 (607 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

Hi,

additional information;
(1) resource is running on DC
(2) shutdown Pacemaker on DC, and resource goes into stop NG(unmanaged)
(3) the other node becomes DC
(4) resource starts on the new DC
(this resource has unmanaged status on the old DC...)

see attached the other hb_report.

By the way, this patch means,
if there are some unmanaged resources, the operation of "Pacemaker
shutdown" becomes successful, right?

High: PE: Bug lf#1959 - Fail unmanaged resources should not prevent
other services from shutting down
https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c

I don't know the detail of lf#1959, and it would be better to setup
STONITH to handle "stop" fail unmanaged resource,
but stop NG action do not permit Pacemaker to shutdown itself just in case.

Thanks,
Junko

2012/2/29 Junko IKEDA <tsukishima.ha [at] gmail>:
> Hi,
>
> I'm running the following simple configuration with Pacemaker 1.1.6,
> and try the test case, "resource stop NG and shutdown Pacemaker".
>
> property \
>    no-quorum-policy="ignore" \
>    stonith-enabled="false" \
>    crmd-transition-delay="2s"
>
> rsc_defaults \
>    resource-stickiness="INFINITY" \
>    migration-threshold="1"
>
> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
>    op start   timeout="60s" interval="0s"  on-fail="restart" \
>    op monitor timeout="60s" interval="7s"  on-fail="restart" \
>    op stop    timeout="60s" interval="0s"  on-fail="block"
>
>
> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
>
> # diff -urNp Dummy Dummy-stop-NG
> --- Dummy       2011-06-30 17:43:37.000000000 +0900
> +++ Dummy-stop-NG       2012-02-28 19:11:12.850207767 +0900
> @@ -108,6 +108,8 @@ dummy_start() {
>  }
>
>  dummy_stop() {
> +    exit $OCF_ERR_GENERIC
> +
>     dummy_monitor
>     if [ $? =  $OCF_SUCCESS ]; then
>        rm ${OCF_RESKEY_state}
>
>
>
> Before the test, the resource is running on "bl460g6a".
>
> # crm_simulate -S -x pe-input-1.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Stopped
>
> Transition Summary:
> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
> dummy01    (bl460g6a)
>
> Executing cluster transition:
>  * Executing action 6: dummy01_monitor_0 on bl460g6b
>  * Executing action 4: dummy01_monitor_0 on bl460g6a
>  * Executing action 7: dummy01_start_0 on bl460g6a
>  * Executing action 8: dummy01_monitor_7000 on bl460g6a
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>
>
>
> Stop Pacemaker on "bl460g6a".
> # service heartbeat stop
>
> Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
> # crm_simulate -S -x pe-input-2.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>
> Transition Summary:
> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
> dummy01    (Started bl460g6a -> bl460g6b)
>
> Executing cluster transition:
>  * Executing action 6: dummy01_stop_0 on bl460g6a
>  * Executing action 7: dummy01_start_0 on bl460g6b
>  * Executing action 8: dummy01_monitor_7000 on bl460g6b
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
>
>
>
> but this action will fail, it means the resource goes into unmanaged state.
> # crm_simulate -S -x pe-input-3.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
> (unmanaged) FAILED
>
> Transition Summary:
>
> Executing cluster transition:
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
> (unmanaged) FAILED
>
>
>
> Pacemaker shutdown on "bl460g6a" becomes successful,
> it seems that the following patch works well.
> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>
> At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
> might be running because it fails to stop.
> In fact, the resource didn't start on "bl460g6b" after its stop NG and
> "bl460g6a"'s shutdown, and this is an expectable behavior,
> but I could start it on "bl460g6b" with crm command.
> This holds the potential for the unexpected active/active status.
> Is it possible to prevent it's start in this situation?
> for example,
> (1) Dummy runs on node-a
> (2) Shutdown Pacemaker on node-a, and Dummy stop NG
> (3) Dummy can not run on other nodes
> (4) * cleanup the unmanaged status of Dummy after checking it's manual
> operation on node-a
> (5) * start Dummy on other nodes
> This can be the safe way.
>
> See attached hb_report.
>
> Thanks,
> Junko IKEDA
>
> NTT DATA INTELLILINK CORPORATION
Attachments: hb_report_dc.tar.bz2 (64.5 KB)


tsukishima.ha at gmail

Feb 29, 2012, 1:30 AM

Post #3 of 12 (607 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

Hi,

sorry again,
I checked the latest code, and it says,

} else if (wrapper->action->rsc
&& wrapper->action->rsc != action->rsc
&& is_set(wrapper->action->rsc->flags, pe_rsc_failed)
&& is_not_set(wrapper->action->rsc->flags, pe_rsc_managed)
&& strstr(wrapper->action->uuid, "_stop_0")
&& action->rsc && action->rsc->variant >= pe_clone) {
crm_warn("Ignoring requirement that %s comeplete before %s:"
" unmanaged failed resources cannot prevent clone shutdown",
wrapper->action->uuid, action->uuid);
return FALSE;

It seems that lf#1959 is for the clone resource issue.
The behavior which I posted is the other one.

In the current specification, does "stop NG action" prevent Pacemaker shutdown?

Thanks,
Junko

2012/2/29 Junko IKEDA <tsukishima.ha [at] gmail>:
> Hi,
>
> additional information;
> (1) resource is running on DC
> (2) shutdown Pacemaker on DC, and resource goes into stop NG(unmanaged)
> (3) the other node becomes DC
> (4) resource starts on the new DC
> (this resource has unmanaged status on the old DC...)
>
> see attached the other hb_report.
>
> By the way, this patch means,
> if there are some unmanaged resources, the operation of "Pacemaker
> shutdown" becomes successful, right?
>
> High: PE: Bug lf#1959 - Fail unmanaged resources should not prevent
> other services from shutting down
> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>
> I don't know the detail of lf#1959, and it would be better to setup
> STONITH to handle "stop" fail unmanaged resource,
> but stop NG action do not permit Pacemaker to shutdown itself just in case.
>
> Thanks,
> Junko
>
> 2012/2/29 Junko IKEDA <tsukishima.ha [at] gmail>:
>> Hi,
>>
>> I'm running the following simple configuration with Pacemaker 1.1.6,
>> and try the test case, "resource stop NG and shutdown Pacemaker".
>>
>> property \
>>    no-quorum-policy="ignore" \
>>    stonith-enabled="false" \
>>    crmd-transition-delay="2s"
>>
>> rsc_defaults \
>>    resource-stickiness="INFINITY" \
>>    migration-threshold="1"
>>
>> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
>>    op start   timeout="60s" interval="0s"  on-fail="restart" \
>>    op monitor timeout="60s" interval="7s"  on-fail="restart" \
>>    op stop    timeout="60s" interval="0s"  on-fail="block"
>>
>>
>> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
>>
>> # diff -urNp Dummy Dummy-stop-NG
>> --- Dummy       2011-06-30 17:43:37.000000000 +0900
>> +++ Dummy-stop-NG       2012-02-28 19:11:12.850207767 +0900
>> @@ -108,6 +108,8 @@ dummy_start() {
>>  }
>>
>>  dummy_stop() {
>> +    exit $OCF_ERR_GENERIC
>> +
>>     dummy_monitor
>>     if [ $? =  $OCF_SUCCESS ]; then
>>        rm ${OCF_RESKEY_state}
>>
>>
>>
>> Before the test, the resource is running on "bl460g6a".
>>
>> # crm_simulate -S -x pe-input-1.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Stopped
>>
>> Transition Summary:
>> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
>> dummy01    (bl460g6a)
>>
>> Executing cluster transition:
>>  * Executing action 6: dummy01_monitor_0 on bl460g6b
>>  * Executing action 4: dummy01_monitor_0 on bl460g6a
>>  * Executing action 7: dummy01_start_0 on bl460g6a
>>  * Executing action 8: dummy01_monitor_7000 on bl460g6a
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>
>>
>>
>> Stop Pacemaker on "bl460g6a".
>> # service heartbeat stop
>>
>> Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
>> # crm_simulate -S -x pe-input-2.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>
>> Transition Summary:
>> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
>> dummy01    (Started bl460g6a -> bl460g6b)
>>
>> Executing cluster transition:
>>  * Executing action 6: dummy01_stop_0 on bl460g6a
>>  * Executing action 7: dummy01_start_0 on bl460g6b
>>  * Executing action 8: dummy01_monitor_7000 on bl460g6b
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
>>
>>
>>
>> but this action will fail, it means the resource goes into unmanaged state.
>> # crm_simulate -S -x pe-input-3.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>> (unmanaged) FAILED
>>
>> Transition Summary:
>>
>> Executing cluster transition:
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>> (unmanaged) FAILED
>>
>>
>>
>> Pacemaker shutdown on "bl460g6a" becomes successful,
>> it seems that the following patch works well.
>> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>>
>> At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
>> might be running because it fails to stop.
>> In fact, the resource didn't start on "bl460g6b" after its stop NG and
>> "bl460g6a"'s shutdown, and this is an expectable behavior,
>> but I could start it on "bl460g6b" with crm command.
>> This holds the potential for the unexpected active/active status.
>> Is it possible to prevent it's start in this situation?
>> for example,
>> (1) Dummy runs on node-a
>> (2) Shutdown Pacemaker on node-a, and Dummy stop NG
>> (3) Dummy can not run on other nodes
>> (4) * cleanup the unmanaged status of Dummy after checking it's manual
>> operation on node-a
>> (5) * start Dummy on other nodes
>> This can be the safe way.
>>
>> See attached hb_report.
>>
>> Thanks,
>> Junko IKEDA
>>
>> NTT DATA INTELLILINK CORPORATION

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andrew at beekhof

Feb 29, 2012, 6:33 PM

Post #4 of 12 (600 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

On Wed, Feb 29, 2012 at 6:32 PM, Junko IKEDA <tsukishima.ha [at] gmail> wrote:
> Hi,
>
> I'm running the following simple configuration with Pacemaker 1.1.6,
> and try the test case, "resource stop NG and shutdown Pacemaker".
>
> property \
>    no-quorum-policy="ignore" \
>    stonith-enabled="false" \
>    crmd-transition-delay="2s"
>
> rsc_defaults \
>    resource-stickiness="INFINITY" \
>    migration-threshold="1"
>
> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
>    op start   timeout="60s" interval="0s"  on-fail="restart" \
>    op monitor timeout="60s" interval="7s"  on-fail="restart" \
>    op stop    timeout="60s" interval="0s"  on-fail="block"
>
>
> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
>
> # diff -urNp Dummy Dummy-stop-NG
> --- Dummy       2011-06-30 17:43:37.000000000 +0900
> +++ Dummy-stop-NG       2012-02-28 19:11:12.850207767 +0900
> @@ -108,6 +108,8 @@ dummy_start() {
>  }
>
>  dummy_stop() {
> +    exit $OCF_ERR_GENERIC
> +
>     dummy_monitor
>     if [ $? =  $OCF_SUCCESS ]; then
>        rm ${OCF_RESKEY_state}
>
>
>
> Before the test, the resource is running on "bl460g6a".
>
> # crm_simulate -S -x pe-input-1.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Stopped
>
> Transition Summary:
> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
> dummy01    (bl460g6a)
>
> Executing cluster transition:
>  * Executing action 6: dummy01_monitor_0 on bl460g6b
>  * Executing action 4: dummy01_monitor_0 on bl460g6a
>  * Executing action 7: dummy01_start_0 on bl460g6a
>  * Executing action 8: dummy01_monitor_7000 on bl460g6a
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>
>
>
> Stop Pacemaker on "bl460g6a".
> # service heartbeat stop
>
> Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
> # crm_simulate -S -x pe-input-2.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>
> Transition Summary:
> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
> dummy01    (Started bl460g6a -> bl460g6b)
>
> Executing cluster transition:
>  * Executing action 6: dummy01_stop_0 on bl460g6a
>  * Executing action 7: dummy01_start_0 on bl460g6b
>  * Executing action 8: dummy01_monitor_7000 on bl460g6b
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
>
>
>
> but this action will fail, it means the resource goes into unmanaged state.
> # crm_simulate -S -x pe-input-3.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
> (unmanaged) FAILED
>
> Transition Summary:
>
> Executing cluster transition:
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
> (unmanaged) FAILED
>
>
>
> Pacemaker shutdown on "bl460g6a" becomes successful,
> it seems that the following patch works well.
> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>
> At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
> might be running because it fails to stop.

This is because we ignore the status section of any offline nodes when
stonith-enabled=false.

> In fact, the resource didn't start on "bl460g6b" after its stop NG and
> "bl460g6a"'s shutdown, and this is an expectable behavior,
> but I could start it on "bl460g6b" with crm command.
> This holds the potential for the unexpected active/active status.
> Is it possible to prevent it's start in this situation?

Only by disabling the logic in
https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
when stonith is disabled.

> for example,
> (1) Dummy runs on node-a
> (2) Shutdown Pacemaker on node-a, and Dummy stop NG
> (3) Dummy can not run on other nodes
> (4) * cleanup the unmanaged status of Dummy after checking it's manual
> operation on node-a
> (5) * start Dummy on other nodes
> This can be the safe way.
>
> See attached hb_report.
>
> Thanks,
> Junko IKEDA
>
> NTT DATA INTELLILINK CORPORATION
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


tsukishima.ha at gmail

Mar 1, 2012, 10:07 PM

Post #5 of 12 (602 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

Hi,

OK, we have to setup STONITH to handle this.
By the way, I tried to run the group resource and do the same test.

crm configuration;

property \
no-quorum-policy="ignore" \
stonith-enabled="false" \
crmd-transition-delay="2s" \
cluster-recheck-interval="60s"

rsc_defaults \
resource-stickiness="INFINITY" \
migration-threshold="1"

primitive dummy01 ocf:heartbeat:Dummy \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="7s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"

primitive dummy02 ocf:heartbeat:Dummy-stop-NG \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="7s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"

group dummy-g dummy01 dummy02


in this case, dummy02 calls stop NG.
dummy02 goes to unmanaged status,
and after that, Pacemaker shutdown is freezing,
it seems that Pacemaker is waiting some clear operations for unmanaged
resources.
if dummy01 calls stop NG, Pacemaker shutdown works well.
see attached hb_report.

Thanks,
Junko

2012/3/1 Andrew Beekhof <andrew [at] beekhof>:
> On Wed, Feb 29, 2012 at 6:32 PM, Junko IKEDA <tsukishima.ha [at] gmail> wrote:
>> Hi,
>>
>> I'm running the following simple configuration with Pacemaker 1.1.6,
>> and try the test case, "resource stop NG and shutdown Pacemaker".
>>
>> property \
>>    no-quorum-policy="ignore" \
>>    stonith-enabled="false" \
>>    crmd-transition-delay="2s"
>>
>> rsc_defaults \
>>    resource-stickiness="INFINITY" \
>>    migration-threshold="1"
>>
>> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
>>    op start   timeout="60s" interval="0s"  on-fail="restart" \
>>    op monitor timeout="60s" interval="7s"  on-fail="restart" \
>>    op stop    timeout="60s" interval="0s"  on-fail="block"
>>
>>
>> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
>>
>> # diff -urNp Dummy Dummy-stop-NG
>> --- Dummy       2011-06-30 17:43:37.000000000 +0900
>> +++ Dummy-stop-NG       2012-02-28 19:11:12.850207767 +0900
>> @@ -108,6 +108,8 @@ dummy_start() {
>>  }
>>
>>  dummy_stop() {
>> +    exit $OCF_ERR_GENERIC
>> +
>>     dummy_monitor
>>     if [ $? =  $OCF_SUCCESS ]; then
>>        rm ${OCF_RESKEY_state}
>>
>>
>>
>> Before the test, the resource is running on "bl460g6a".
>>
>> # crm_simulate -S -x pe-input-1.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Stopped
>>
>> Transition Summary:
>> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
>> dummy01    (bl460g6a)
>>
>> Executing cluster transition:
>>  * Executing action 6: dummy01_monitor_0 on bl460g6b
>>  * Executing action 4: dummy01_monitor_0 on bl460g6a
>>  * Executing action 7: dummy01_start_0 on bl460g6a
>>  * Executing action 8: dummy01_monitor_7000 on bl460g6a
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>
>>
>>
>> Stop Pacemaker on "bl460g6a".
>> # service heartbeat stop
>>
>> Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
>> # crm_simulate -S -x pe-input-2.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>
>> Transition Summary:
>> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
>> dummy01    (Started bl460g6a -> bl460g6b)
>>
>> Executing cluster transition:
>>  * Executing action 6: dummy01_stop_0 on bl460g6a
>>  * Executing action 7: dummy01_start_0 on bl460g6b
>>  * Executing action 8: dummy01_monitor_7000 on bl460g6b
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
>>
>>
>>
>> but this action will fail, it means the resource goes into unmanaged state.
>> # crm_simulate -S -x pe-input-3.bz2
>>
>> Current cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>> (unmanaged) FAILED
>>
>> Transition Summary:
>>
>> Executing cluster transition:
>>
>> Revised cluster status:
>> Online: [ bl460g6a bl460g6b ]
>>
>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>> (unmanaged) FAILED
>>
>>
>>
>> Pacemaker shutdown on "bl460g6a" becomes successful,
>> it seems that the following patch works well.
>> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>>
>> At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
>> might be running because it fails to stop.
>
> This is because we ignore the status section of any offline nodes when
> stonith-enabled=false.
>
>> In fact, the resource didn't start on "bl460g6b" after its stop NG and
>> "bl460g6a"'s shutdown, and this is an expectable behavior,
>> but I could start it on "bl460g6b" with crm command.
>> This holds the potential for the unexpected active/active status.
>> Is it possible to prevent it's start in this situation?
>
> Only by disabling the logic in
>   https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
> when stonith is disabled.
>
>> for example,
>> (1) Dummy runs on node-a
>> (2) Shutdown Pacemaker on node-a, and Dummy stop NG
>> (3) Dummy can not run on other nodes
>> (4) * cleanup the unmanaged status of Dummy after checking it's manual
>> operation on node-a
>> (5) * start Dummy on other nodes
>> This can be the safe way.
>>
>> See attached hb_report.
>>
>> Thanks,
>> Junko IKEDA
>>
>> NTT DATA INTELLILINK CORPORATION
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
Attachments: crm_simulate.txt (4.84 KB)
  hb_report.tar.bz2 (68.2 KB)


andrew at beekhof

Mar 4, 2012, 5:35 PM

Post #6 of 12 (572 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

On Fri, Mar 2, 2012 at 5:07 PM, Junko IKEDA <tsukishima.ha [at] gmail> wrote:
> Hi,
>
> OK, we have to setup STONITH to handle this.
> By the way, I tried to run the group resource and do the same test.
>
> crm configuration;
>
> property \
>    no-quorum-policy="ignore" \
>    stonith-enabled="false" \
>    crmd-transition-delay="2s" \
>    cluster-recheck-interval="60s"
>
> rsc_defaults \
>    resource-stickiness="INFINITY" \
>    migration-threshold="1"
>
> primitive dummy01 ocf:heartbeat:Dummy \
>    op start   timeout="60s" interval="0s"  on-fail="restart" \
>    op monitor timeout="60s" interval="7s"  on-fail="restart" \
>    op stop    timeout="60s" interval="0s"  on-fail="block"
>
> primitive dummy02 ocf:heartbeat:Dummy-stop-NG \
>    op start   timeout="60s" interval="0s"  on-fail="restart" \
>    op monitor timeout="60s" interval="7s"  on-fail="restart" \
>    op stop    timeout="60s" interval="0s"  on-fail="block"
>
> group dummy-g dummy01 dummy02
>
>
> in this case, dummy02 calls stop NG.
> dummy02 goes to unmanaged status,
> and after that, Pacemaker shutdown is freezing,

On the one hand the admin is saying "always stop A before B", but then
also asking for "stop B" while preventing "stop A".
So the admin is making incompatible demands, which one do you want us to ignore?

> it seems that Pacemaker is waiting some clear operations for unmanaged
> resources.
> if dummy01 calls stop NG, Pacemaker shutdown works well.
> see attached hb_report.
>
> Thanks,
> Junko
>
> 2012/3/1 Andrew Beekhof <andrew [at] beekhof>:
>> On Wed, Feb 29, 2012 at 6:32 PM, Junko IKEDA <tsukishima.ha [at] gmail> wrote:
>>> Hi,
>>>
>>> I'm running the following simple configuration with Pacemaker 1.1.6,
>>> and try the test case, "resource stop NG and shutdown Pacemaker".
>>>
>>> property \
>>>    no-quorum-policy="ignore" \
>>>    stonith-enabled="false" \
>>>    crmd-transition-delay="2s"
>>>
>>> rsc_defaults \
>>>    resource-stickiness="INFINITY" \
>>>    migration-threshold="1"
>>>
>>> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
>>>    op start   timeout="60s" interval="0s"  on-fail="restart" \
>>>    op monitor timeout="60s" interval="7s"  on-fail="restart" \
>>>    op stop    timeout="60s" interval="0s"  on-fail="block"
>>>
>>>
>>> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
>>>
>>> # diff -urNp Dummy Dummy-stop-NG
>>> --- Dummy       2011-06-30 17:43:37.000000000 +0900
>>> +++ Dummy-stop-NG       2012-02-28 19:11:12.850207767 +0900
>>> @@ -108,6 +108,8 @@ dummy_start() {
>>>  }
>>>
>>>  dummy_stop() {
>>> +    exit $OCF_ERR_GENERIC
>>> +
>>>     dummy_monitor
>>>     if [ $? =  $OCF_SUCCESS ]; then
>>>        rm ${OCF_RESKEY_state}
>>>
>>>
>>>
>>> Before the test, the resource is running on "bl460g6a".
>>>
>>> # crm_simulate -S -x pe-input-1.bz2
>>>
>>> Current cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Stopped
>>>
>>> Transition Summary:
>>> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
>>> dummy01    (bl460g6a)
>>>
>>> Executing cluster transition:
>>>  * Executing action 6: dummy01_monitor_0 on bl460g6b
>>>  * Executing action 4: dummy01_monitor_0 on bl460g6a
>>>  * Executing action 7: dummy01_start_0 on bl460g6a
>>>  * Executing action 8: dummy01_monitor_7000 on bl460g6a
>>>
>>> Revised cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>>
>>>
>>>
>>> Stop Pacemaker on "bl460g6a".
>>> # service heartbeat stop
>>>
>>> Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
>>> # crm_simulate -S -x pe-input-2.bz2
>>>
>>> Current cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>>
>>> Transition Summary:
>>> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
>>> dummy01    (Started bl460g6a -> bl460g6b)
>>>
>>> Executing cluster transition:
>>>  * Executing action 6: dummy01_stop_0 on bl460g6a
>>>  * Executing action 7: dummy01_start_0 on bl460g6b
>>>  * Executing action 8: dummy01_monitor_7000 on bl460g6b
>>>
>>> Revised cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
>>>
>>>
>>>
>>> but this action will fail, it means the resource goes into unmanaged state.
>>> # crm_simulate -S -x pe-input-3.bz2
>>>
>>> Current cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>> (unmanaged) FAILED
>>>
>>> Transition Summary:
>>>
>>> Executing cluster transition:
>>>
>>> Revised cluster status:
>>> Online: [ bl460g6a bl460g6b ]
>>>
>>>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>>> (unmanaged) FAILED
>>>
>>>
>>>
>>> Pacemaker shutdown on "bl460g6a" becomes successful,
>>> it seems that the following patch works well.
>>> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>>>
>>> At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
>>> might be running because it fails to stop.
>>
>> This is because we ignore the status section of any offline nodes when
>> stonith-enabled=false.
>>
>>> In fact, the resource didn't start on "bl460g6b" after its stop NG and
>>> "bl460g6a"'s shutdown, and this is an expectable behavior,
>>> but I could start it on "bl460g6b" with crm command.
>>> This holds the potential for the unexpected active/active status.
>>> Is it possible to prevent it's start in this situation?
>>
>> Only by disabling the logic in
>>   https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>> when stonith is disabled.
>>
>>> for example,
>>> (1) Dummy runs on node-a
>>> (2) Shutdown Pacemaker on node-a, and Dummy stop NG
>>> (3) Dummy can not run on other nodes
>>> (4) * cleanup the unmanaged status of Dummy after checking it's manual
>>> operation on node-a
>>> (5) * start Dummy on other nodes
>>> This can be the safe way.
>>>
>>> See attached hb_report.
>>>
>>> Thanks,
>>> Junko IKEDA
>>>
>>> NTT DATA INTELLILINK CORPORATION
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


tsukishima.ha at gmail

Mar 5, 2012, 1:45 AM

Post #7 of 12 (557 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

Hi,

> On the one hand the admin is saying "always stop A before B", but then
> also asking for "stop B" while preventing "stop A".
> So the admin is making incompatible demands, which one do you want us to ignore?

It seems that the current Pacemaker does shutdown in spite of
unmanaged resources when "stonith-enabled=false",
so one unmanaged resource in group should be ignored during shutdown process.
By the way, I tried Master/Slave setup which contained the above group
resource without stonith ("stonith-enabled=false"),
Pacemaker shutdown went well and Master resource's fail-over was also
successful.
The above simple group behavior(prevents "shutdown") is peculiarity.

Actually, it's desirable to "prevent Pacemaker shutdown" if there are
unmanaged resource,
but this behavior has been changed?
# I found an old changelog, it said "High: crmd: Bug LF1837 -
Unmanaged resources prevent crmd from shutting down"

* Wed Apr 23 2008 Andrew Beekhof <abeekhof [at] suse> - 0.6.3-1
- Update source tarball to revision: fd8904c9bc67 tip
- Statistics:
Changesets: 117
Diff: 354 files changed, 19094 insertions(+), 11338 deletions(-)
- Changes since Pacemaker-0.6.2
+ High: Admin: Bug LF:1848 - crm_resource - Pass set name and id to
delete_resource_attr() in the correct order
+ High: Build: SNMP has been moved to the management/pygui project
+ High: crmd: Bug LF1837 - Unmanaged resources prevent crmd from shutting down


Thanks,
Junko

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andrew at beekhof

Mar 5, 2012, 2:01 AM

Post #8 of 12 (563 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

On Mon, Mar 5, 2012 at 8:45 PM, Junko IKEDA <tsukishima.ha [at] gmail> wrote:
> Hi,
>
>> On the one hand the admin is saying "always stop A before B", but then
>> also asking for "stop B" while preventing "stop A".
>> So the admin is making incompatible demands, which one do you want us to ignore?
>
> It seems that the current Pacemaker does shutdown in spite of
> unmanaged resources when "stonith-enabled=false",
> so one unmanaged resource in group should be ignored during shutdown process.
> By the way, I tried Master/Slave setup which contained the above group
> resource without stonith ("stonith-enabled=false"),
> Pacemaker shutdown went well and Master resource's fail-over was also
> successful.
> The above simple group behavior(prevents "shutdown") is peculiarity.
>
> Actually, it's desirable to "prevent Pacemaker shutdown" if there are
> unmanaged resource,

What about unmanaged /and/ failed?

> but this behavior has been changed?
> # I found an old changelog, it said "High: crmd: Bug LF1837 -
> Unmanaged resources prevent crmd from shutting down"

Thats unrelated to this actually.

>
> * Wed Apr 23 2008 Andrew Beekhof <abeekhof [at] suse> - 0.6.3-1
> - Update source tarball to revision: fd8904c9bc67 tip
> - Statistics:
>    Changesets:      117
>    Diff:            354 files changed, 19094 insertions(+), 11338 deletions(-)
> - Changes since Pacemaker-0.6.2
>  + High: Admin: Bug LF:1848 - crm_resource - Pass set name and id to
> delete_resource_attr() in the correct order
>  + High: Build: SNMP has been moved to the management/pygui project
>  + High: crmd: Bug LF1837 - Unmanaged resources prevent crmd from shutting down
>
>
> Thanks,
> Junko
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


tsukishima.ha at gmail

Mar 5, 2012, 2:40 AM

Post #9 of 12 (560 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

>> Actually, it's desirable to "prevent Pacemaker shutdown" if there are
>> unmanaged resource,
>
> What about unmanaged /and/ failed?

To be more specific, if there remains the resource which failed to
stop operation and went to the unmanaged status with its
on-fail="block" configuration,
it would be better to prevent Pacemaker from its shutdown.
Is it difficult to discriminate between "unmanaged(stop failure)" and
"unmanaged(operational)"?
well, stonith-enabled="false" should not be configured...

Thanks,
Junko

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andrew at beekhof

Mar 6, 2012, 1:59 AM

Post #10 of 12 (555 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

On Mon, Mar 5, 2012 at 9:40 PM, Junko IKEDA <tsukishima.ha [at] gmail> wrote:
>>> Actually, it's desirable to "prevent Pacemaker shutdown" if there are
>>> unmanaged resource,
>>
>> What about unmanaged /and/ failed?
>
> To be more specific, if there remains the resource which failed to
> stop operation and went to the unmanaged status with its
> on-fail="block" configuration,
> it would be better to prevent Pacemaker from its shutdown.

I tend to agree.
What I'm working on at the moment is correctly marking dependant
actions as unrunnable and providing some reasonable feedback to users
when the situation occurs.

So from (one of) your examples, pe-input-3.bz2 now emits:

warning: stage8: Cannot shut down node 'bl460g6a' because of
dummy02: unmanaged, failed
warning: stage8: Cannot shut down node 'bl460g6a' because of dummy01: blocked

Does that help?

I also need to correctly distinguish between your case and "i want
pacemaker to exit and leave the services running".

> Is it difficult to discriminate between "unmanaged(stop failure)" and
> "unmanaged(operational)"?
> well, stonith-enabled="false" should not be configured...
>
> Thanks,
> Junko
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


tsukishima.ha at gmail

Mar 11, 2012, 10:31 PM

Post #11 of 12 (499 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

Hi,

I tried the latest code and the following commit worked well for our case, too.
It means "unmanaged, stop failed resource" prevents Pacemaker's
shutdown, great :)
https://github.com/ClusterLabs/pacemaker/commit/8d2f237dc5f900381adf62a6e949ec71d1ee54e5

but, I got the other case, it's "Master/Slave" configuration.

1) initial status;
node-a = Master
node-b = Slave

2) node-a -> shutdown Pacemaker
3) node-a -> failed to stop Master while the shutdown process

4) final status;
node-a = unmanaged "Slave"
node-b = Master

In this case, "demote" operation goes well, so it makes sense to
promote node-b as Master.
I think drbd RA can handle this situation,
but pgsql RA (replication mode) will go into the unexpected dual master...
this can be related to the following issue.
http://www.gossamer-threads.com/lists/linuxha/pacemaker/78644
I'll ask pgsql people this again for now.

Many thanks,
Junko

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andrew at beekhof

Mar 12, 2012, 8:14 PM

Post #12 of 12 (492 views)
Permalink
Re: prevent the resource's start if it has "stop NG" history on the other node [In reply to]

On Mon, Mar 12, 2012 at 4:31 PM, Junko IKEDA <tsukishima.ha [at] gmail> wrote:
> Hi,
>
> I tried the latest code and the following commit worked well for our case, too.
> It means "unmanaged, stop failed resource" prevents Pacemaker's
> shutdown, great :)
> https://github.com/ClusterLabs/pacemaker/commit/8d2f237dc5f900381adf62a6e949ec71d1ee54e5
>
> but, I got the other case, it's "Master/Slave" configuration.
>
> 1) initial status;
> node-a = Master
> node-b = Slave
>
> 2) node-a -> shutdown Pacemaker
> 3) node-a -> failed to stop Master while the shutdown process
>
> 4) final status;
> node-a = unmanaged "Slave"
> node-b = Master
>
> In this case, "demote" operation goes well, so it makes sense to
> promote node-b as Master.

Is it though? In the general case.
With one instance mis-behaving, in an unknown state and potentially
doing anything, I would not have thought this would be a good time to
promote another instance.

In any case, the 'block' applies to the whole resource, not just the
failed instance.

Even if I wanted to, I don't think we could change the behaviour in this case.
All instances that need to be demoted or stopped must be, before any
other instances can be started or promoted.
Bypassing that premise would probably cause the implementation to fall
apart :-)

> I think drbd RA can handle this situation,
> but pgsql RA (replication mode) will go into the unexpected dual master...
> this can be related to the following issue.
> http://www.gossamer-threads.com/lists/linuxha/pacemaker/78644
> I'll ask pgsql people this again for now.
>
> Many thanks,
> Junko
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.