Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

start/stop operations fail to happen in parallel on resources

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


parshvi.17 at gmail

Apr 19, 2012, 4:22 AM

Post #1 of 4 (265 views)
Permalink
start/stop operations fail to happen in parallel on resources

Observations:
max-children=30
total no. of resources=18

1) At a default value 4 of max-children, following logs were observed
that led to monitor op’s timeout for some resources (a total of 18 rscs):
a. “max_child_count (4) reached, postponing execution of operation monitor”
b. “WARN: perform_ra_op: the operation operation monitor[18] on
ocf::IPaddr2::ClusterIP for client 3754, stayed in operation list for
14100 ms (longer than 10000 ms)”
c. SOLUTION: the max-children of lrmd was raised to 30.
d. ISSUES STILL OBSERVED: while 2-3 resources are stuck in start operation,
if a rsc is issued an explicit start command `crm resource start rcs1`, then the
start op on this rsc is delayed until any one of the previous resources exit
from their start operation.



_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


df.cluster at gmail

Apr 19, 2012, 6:05 AM

Post #2 of 4 (254 views)
Permalink
Re: start/stop operations fail to happen in parallel on resources [In reply to]

Hi,

On Thu, Apr 19, 2012 at 2:22 PM, Parshvi <parshvi.17 [at] gmail> wrote:
> Observations:
> max-children=30
> total no. of resources=18
>
> 1) At a default value 4 of max-children, following logs were observed
> that led to monitor op’s timeout for some resources (a total of 18 rscs):
>  a. “max_child_count (4) reached, postponing execution of operation monitor”
>  b. “WARN: perform_ra_op: the operation operation monitor[18] on
> ocf::IPaddr2::ClusterIP for client 3754, stayed in operation list for
> 14100 ms (longer than 10000 ms)”
>  c. SOLUTION: the max-children of lrmd was raised to 30.
>  d. ISSUES STILL OBSERVED: while 2-3 resources are stuck in start operation,
> if a rsc is issued an explicit start command `crm resource start rcs1`, then the
> start op on this rsc is delayed until any one of the previous resources exit
> from their start operation.

What version of Pacemaker?

>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



--
Dan Frincu
CCNA, RHCE

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


dvossel at redhat

Apr 19, 2012, 7:30 AM

Post #3 of 4 (254 views)
Permalink
Re: start/stop operations fail to happen in parallel on resources [In reply to]

----- Original Message -----
> From: "Parshvi" <parshvi.17 [at] gmail>
> To: pacemaker [at] clusterlabs
> Sent: Thursday, April 19, 2012 6:22:01 AM
> Subject: [Pacemaker] start/stop operations fail to happen in parallel on resources
>
> Observations:
> max-children=30
> total no. of resources=18
>
> 1) At a default value 4 of max-children, following logs were observed
> that led to monitor op’s timeout for some resources (a total of 18
> rscs):
> a. “max_child_count (4) reached, postponing execution of operation
> monitor”
> b. “WARN: perform_ra_op: the operation operation monitor[18] on
> ocf::IPaddr2::ClusterIP for client 3754, stayed in operation list for
> 14100 ms (longer than 10000 ms)”
> c. SOLUTION: the max-children of lrmd was raised to 30.
> d. ISSUES STILL OBSERVED: while 2-3 resources are stuck in start
> operation,
> if a rsc is issued an explicit start command `crm resource start
> rcs1`, then the
> start op on this rsc is delayed until any one of the previous
> resources exit
> from their start operation.
>

This is what I would expect to happen. If a operation is in flight at the same time you make a configuration change, I don't believe the change will be looked at until the operation returns or times out.

-- Vossel

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


andrew at beekhof

May 9, 2012, 9:15 PM

Post #4 of 4 (218 views)
Permalink
Re: start/stop operations fail to happen in parallel on resources [In reply to]

On Fri, Apr 20, 2012 at 12:30 AM, David Vossel <dvossel [at] redhat> wrote:
> ----- Original Message -----
>> From: "Parshvi" <parshvi.17 [at] gmail>
>> To: pacemaker [at] clusterlabs
>> Sent: Thursday, April 19, 2012 6:22:01 AM
>> Subject: [Pacemaker] start/stop operations fail to happen in parallel on resources
>>
>> Observations:
>> max-children=30
>> total no. of resources=18
>>
>> 1) At a default value 4 of max-children, following logs were observed
>> that led to monitor ops timeout for some resources (a total of 18
>> rscs):
>> a. max_child_count (4) reached, postponing execution of operation
>> monitor
>> b. WARN: perform_ra_op: the operation operation monitor[18] on
>> ocf::IPaddr2::ClusterIP for client 3754, stayed in operation list for
>> 14100 ms (longer than 10000 ms)
>> c. SOLUTION: the max-children of lrmd was raised to 30.
>> d. ISSUES STILL OBSERVED: while 2-3 resources are stuck in start
>> operation,
>> if a rsc is issued an explicit start command `crm resource start
>> rcs1`, then the
>> start op on this rsc is delayed until any one of the previous
>> resources exit
>> from their start operation.
>>
>
> This is what I would expect to happen. If a operation is in flight at the same time you make a configuration change, I don't believe the change will be looked at until the operation returns or times out.

Correct. We wait for any in-flight operations to complete but do not
initiate any more.
You can also set batch-limit to prevent pacemaker from sending "too
many" operations to the lrmd in the first place, but setting
max-children to 30 on a decent machine doesn't seem unreasonable.

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.