Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

Pacemaker : Pb on stop on a resource while the monitoring is performed

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


alain.moulle at bull

Sep 1, 2011, 5:00 AM

Post #1 of 3 (254 views)
Permalink
Pacemaker : Pb on stop on a resource while the monitoring is performed

Hi

My release is :
pacemaker-1.1.2-7 (on RHEL6)
and I have checked that the patch :
High: PE: Bug lf#2433 - No services should be stopped until probes finish
is effectively integrated in this release.

Nethertheless, it seems that I got a similar problem from time to time for
whatever primitive: a primitive under pacemaker is flagged "failed" for
one
node whereas the primitive is already started on the other node. Then a
simple cleanup on the group erase the Failure and all is fine, but
it happens let's say within two hours when I start a loop (a robustness
test) of migration on the group (which includes the primitive) from one
node to the other and vice-versa with a delay of 300s between each
migration.

If I compare the logs (syslog) generated by the scenario when all is fine
and when I got the error, the first error I found is :
node1 daemon info lrmd [38904]: info: flush_op: process for operation
monitor[2973] on ocf:<provider>:<scriptname>::<primitive name> for client
38907 still running, flush delayed
node1 daemon debug crmd [38907]: debug: cancel_op: Op 2973 for
<primitive-name> (<primitive-name>:2973): cancelled

It seems that Pacemaker applies the stop on the primitive running on node1
just at the moment when a monitoring is currently checking the primitive,
so the
monitor stop operation is delayed. The primitive stop is effective and the
primitive starts on node2. After 20 seconds, the monitor operation on
node1 is running again, it fails and is notfied as errorneous on node1.
Therefore, no more switch to node1 is possible, unless a manual crm
cleanup on the primitive is executed.

Thanks for your ideas on this problem.
Alain



_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Sep 22, 2011, 8:19 PM

Post #2 of 3 (215 views)
Permalink
Re: Pacemaker : Pb on stop on a resource while the monitoring is performed [In reply to]

On Thu, Sep 1, 2011 at 10:00 PM, <alain.moulle [at] bull> wrote:
> Hi
>
> My release is :
> pacemaker-1.1.2-7 (on RHEL6)
> and I have checked that the patch :
> High: PE: Bug lf#2433 - No services should be stopped until probes finish
> is effectively integrated in this release.
>
> Nethertheless, it seems that I got a similar problem from time to time for
> whatever primitive: a primitive under pacemaker is flagged "failed" for
> one
> node whereas the primitive is already started on the other node. Then a
> simple cleanup on the group erase the Failure and all is fine, but
> it happens let's say within two hours when I start a loop (a robustness
> test) of migration on the group (which includes the primitive) from one
> node to the other and vice-versa with a delay of 300s between each
> migration.
>
> If I compare the logs (syslog) generated by the scenario when all is fine
> and when I got the error, the first error I found is :
> node1 daemon info lrmd [38904]: info: flush_op: process for operation
> monitor[2973] on ocf:<provider>:<scriptname>::<primitive name> for client
> 38907 still running, flush delayed
> node1 daemon debug crmd [38907]: debug: cancel_op: Op 2973 for
> <primitive-name> (<primitive-name>:2973): cancelled
>
> It seems that Pacemaker applies the stop on the primitive running on node1
> just at the moment when a monitoring is currently checking the primitive,
> so the
> monitor stop operation is delayed. The primitive stop is effective and the
> primitive starts on node2. After 20 seconds, the monitor operation on
> node1 is running again, it fails and is notfied as errorneous on node1.
> Therefore, no more switch to node1 is possible, unless a manual crm
> cleanup on the primitive is executed.
>
> Thanks for your ideas on this problem.

Sounds like a bug in the lrmd to me. I'd say file a bug but its still
down after the LF got hacked a few weeks back :-(


> Alain
>
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Dec 8, 2011, 1:38 PM

Post #3 of 3 (158 views)
Permalink
Re: Pacemaker : Pb on stop on a resource while the monitoring is performed [In reply to]

On Tue, Nov 29, 2011 at 12:54 AM, <alain.moulle [at] bull> wrote:
> Hi
>
> I always have this problem.
> Just a little question : when this occurs, meaning a monitoring happening
> whereas there is
> just a crm command request on the resource i.e. migration, why not just
> return SUCCESS
> so that the next monitoring on this resource will be executed on the good
> node ? instead
> of delaying the monitoring a few seconds after, and when the resouce is
> obviously no more
> on the same node ?

The monitor op shouldn't be executed at all.
We cancel them before initiating the migration:

/* stop the monitor before stopping the resource */
if (crm_str_eq(operation, CRMD_ACTION_STOP, TRUE)
|| crm_str_eq(operation, CRMD_ACTION_DEMOTE, TRUE)
|| crm_str_eq(operation, CRMD_ACTION_PROMOTE, TRUE)
|| crm_str_eq(operation, CRMD_ACTION_MIGRATE, TRUE)) {
g_hash_table_foreach_remove(pending_ops,
stop_recurring_action_by_rsc, rsc);
}
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.