Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover.

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


parshvi.17 at gmail

Aug 8, 2012, 10:14 PM

Post #1 of 9 (450 views)
Permalink
Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover.

Hi,

The monitor operation of IPaddr2 rsc agent is timing out.
Interval: 5s
Timeout: 60s
The timeout was increased from an earlier 20s to now 60s. Even then, there are
multiple logs of monitor op. timing out.

1) What can cause the monitor to take so long ?
2) Looking at the pe-input, what contributes to the operation time ? Is it just
the exec-time or exec-time + queue-time ?
3) Any solution proposed ?

I have lrm pe-input when the timeout was configured at 20s:
Here, is pe-input snapshot where monitor op. timed out (with timeout=20s)

<lrm_resource id="Group_1_ClusterIP" type="IPaddr2" class="ocf"
provider="heartbeat">
<lrm_rsc_op id="Group_1_ClusterIP_monitor_0" operation="monitor"
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1" transition-
key="28:0:7:6b445452-980a-455f-8616-7bd12f20843e" transition-
magic="0:7;28:0:7:6b445452-980a-455f-8616-7bd12f20843e" call-id="10" rc-code="7"
op-status="0" interval="0" last-run="1343738096" last-rc-change="1343738096"
exec-time="20" queue-time="30" op-digest="f22a042c86b227078b239707d4e4d4a2"/>

<lrm_rsc_op id="Group_1_ClusterIP_start_0" operation="start" crm-
debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
key="87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
magic="0:0;87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83503" rc-
code="0" op-status="0" interval="0" last-run="1343928908" last-rc-
change="1343928908" exec-time="280" queue-time="20" op-
digest="f22a042c86b227078b239707d4e4d4a2"/>

<lrm_rsc_op id="Group_1_ClusterIP_monitor_5000" operation="monitor"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
key="12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
magic="2:-2;12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83504" rc-
code="-2" op-status="2" interval="5000" last-rc-change="1343928921" exec-
time="20000" queue-time="0" op-digest="79c3bdd01c6e0fd819484536a54bf7a2"/>
(Please note exec-time=20000)

<lrm_rsc_op id="Group_1_ClusterIP_stop_0" operation="stop" crm-
debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
key="13:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
magic="0:0;13:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83497" rc-
code="0" op-status="0" interval="0" last-run="1343928906" last-rc-
change="1343928906" exec-time="1190" queue-time="30" op-
digest="f22a042c86b227078b239707d4e4d4a2"/>
</lrm_resource>


Please tell me if any other input is required. I would appreciate any early
help/solution.

Thanks,
Parshvi



_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


parshvi.17 at gmail

Aug 8, 2012, 10:44 PM

Post #2 of 9 (423 views)
Permalink
Re: Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover. [In reply to]

Parshvi <parshvi.17@...> writes:

>
> Hi,
>
> The monitor operation of IPaddr2 rsc agent is timing out.
> Interval: 5s
> Timeout: 60s
> The timeout was increased from an earlier 20s to now 60s. Even then, there are
> multiple logs of monitor op. timing out.
>
> 1) What can cause the monitor to take so long ?
> 2) Looking at the pe-input, what contributes to the operation time ? Is it
just
> the exec-time or exec-time + queue-time ?
> 3) Any solution proposed ?
>
> I have lrm pe-input when the timeout was configured at 20s:
> Here, is pe-input snapshot where monitor op. timed out (with timeout=20s)
>
> <lrm_resource id="Group_1_ClusterIP" type="IPaddr2" class="ocf"
> provider="heartbeat">
> <lrm_rsc_op id="Group_1_ClusterIP_monitor_0" operation="monitor"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1" transition-
> key="28:0:7:6b445452-980a-455f-8616-7bd12f20843e" transition-
> magic="0:7;28:0:7:6b445452-980a-455f-8616-7bd12f20843e" call-id="10" rc-
code="7"
> op-status="0" interval="0" last-run="1343738096" last-rc-change="1343738096"
> exec-time="20" queue-time="30" op-digest="f22a042c86b227078b239707d4e4d4a2"/>
>
> <lrm_rsc_op id="Group_1_ClusterIP_start_0" operation="start" crm-
> debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> key="87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> magic="0:0;87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83503"
rc-
> code="0" op-status="0" interval="0" last-run="1343928908" last-rc-
> change="1343928908" exec-time="280" queue-time="20" op-
> digest="f22a042c86b227078b239707d4e4d4a2"/>
>
> <lrm_rsc_op id="Group_1_ClusterIP_monitor_5000"
operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> key="12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> magic="2:-2;12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83504"
rc-
> code="-2" op-status="2" interval="5000" last-rc-change="1343928921" exec-
> time="20000" queue-time="0" op-digest="79c3bdd01c6e0fd819484536a54bf7a2"/>
> (Please note exec-time=20000)
>

Following are the details of packages:
cluster-glue: 1.0.6 (1c87a0c58c59fc384b93ec11476cefdbb6ddc1e1)
resource-agents: # Build version: 7a11934b142d1daf42a04fbaa0391a3ac47cee4c
CRM Version: 1.0.12 (unknown)
pacemaker 1.0.12-1.el5.centos - (none) x86_64
corosync 1.2.7-1.1.el5 - (none) x86_64
resource-agents 1.0.4-1.1.el5 - (none) x86_64

There are 4 virtual IP resources configued:

Out of these, 3 recover after a monitor timeout but one Virtual IP rsc does not
recover. Following are the logs that are observed:

Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: run_graph: Transition 63579
(Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=8,
Source=/var/lib/pengine/pe-input-1660.bz2): Terminated
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: ERROR: te_graph_trigger: Transition
failed: terminated
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Graph 63579 (9
actions in 9 synapses): batch-limit=30 jobs, network-delay=60000ms
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 0 was
confirmed (priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 1 is pending
(priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 8]:
Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 103]:
Pending (id: Rsc2_stop_0, loc: CSS-FU-1, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 2 is pending
(priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 97]:
Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 8]:
Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 3 is pending
(priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 98]:
Pending (id: Rsc1_GroupClusterIP_monitor_1000, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 4 is pending
(priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 99]:
Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 5 is pending
(priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 100]:
Pending (id: Rsc3_monitor_1000, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 99]:
Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 6 is pending
(priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 101]:
Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 7 is pending
(priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 102]:
Pending (id: Rsc4_monitor_1000, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 101]:
Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 8 is pending
(priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 36]:
Pending (id: all_stopped, type: pseduo, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 8]:
Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: te_graph_trigger: Transition 63579
is now complete
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: notify_crmd: Transition 63579
status: done - <null>
Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]

1) Please help as to why a monitor is timing out ?
2) Why does one of the VIP's fails to recover after a timeout ?




_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


mario.penners at gmail

Aug 9, 2012, 6:52 AM

Post #3 of 9 (416 views)
Permalink
Re: Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover. [In reply to]

Hi Parshvi,

just a quick-shot and without analyzing your mail in detail: find
attached an edited version of the IPaddr2 RA.

I was trying to use the original script a while agho, and basically
nothing worked: It did not recognize the link failures (due to the way
how the test was implemented it would only work if you have not more
than 1 IP per interface), there was no proper support for bonding, the
IP addresses would not be shifted ....


I did some (very minor) changes to ge the script working for us. Just
have a shot at it if you want, maybe replacing the RA will already solve
your problem.

Cheers,
Mario


On Thu, 2012-08-09 at 05:44 +0000, Parshvi wrote:
> Parshvi <parshvi.17@...> writes:
>
> >
> > Hi,
> >
> > The monitor operation of IPaddr2 rsc agent is timing out.
> > Interval: 5s
> > Timeout: 60s
> > The timeout was increased from an earlier 20s to now 60s. Even then, there are
> > multiple logs of monitor op. timing out.
> >
> > 1) What can cause the monitor to take so long ?
> > 2) Looking at the pe-input, what contributes to the operation time ? Is it
> just
> > the exec-time or exec-time + queue-time ?
> > 3) Any solution proposed ?
> >
> > I have lrm pe-input when the timeout was configured at 20s:
> > Here, is pe-input snapshot where monitor op. timed out (with timeout=20s)
> >
> > <lrm_resource id="Group_1_ClusterIP" type="IPaddr2" class="ocf"
> > provider="heartbeat">
> > <lrm_rsc_op id="Group_1_ClusterIP_monitor_0" operation="monitor"
> > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1" transition-
> > key="28:0:7:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > magic="0:7;28:0:7:6b445452-980a-455f-8616-7bd12f20843e" call-id="10" rc-
> code="7"
> > op-status="0" interval="0" last-run="1343738096" last-rc-change="1343738096"
> > exec-time="20" queue-time="30" op-digest="f22a042c86b227078b239707d4e4d4a2"/>
> >
> > <lrm_rsc_op id="Group_1_ClusterIP_start_0" operation="start" crm-
> > debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> > key="87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > magic="0:0;87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83503"
> rc-
> > code="0" op-status="0" interval="0" last-run="1343928908" last-rc-
> > change="1343928908" exec-time="280" queue-time="20" op-
> > digest="f22a042c86b227078b239707d4e4d4a2"/>
> >
> > <lrm_rsc_op id="Group_1_ClusterIP_monitor_5000"
> operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> > key="12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > magic="2:-2;12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83504"
> rc-
> > code="-2" op-status="2" interval="5000" last-rc-change="1343928921" exec-
> > time="20000" queue-time="0" op-digest="79c3bdd01c6e0fd819484536a54bf7a2"/>
> > (Please note exec-time=20000)
> >
>
> Following are the details of packages:
> cluster-glue: 1.0.6 (1c87a0c58c59fc384b93ec11476cefdbb6ddc1e1)
> resource-agents: # Build version: 7a11934b142d1daf42a04fbaa0391a3ac47cee4c
> CRM Version: 1.0.12 (unknown)
> pacemaker 1.0.12-1.el5.centos - (none) x86_64
> corosync 1.2.7-1.1.el5 - (none) x86_64
> resource-agents 1.0.4-1.1.el5 - (none) x86_64
>
> There are 4 virtual IP resources configued:
>
> Out of these, 3 recover after a monitor timeout but one Virtual IP rsc does not
> recover. Following are the logs that are observed:
>
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: run_graph: Transition 63579
> (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=8,
> Source=/var/lib/pengine/pe-input-1660.bz2): Terminated
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: ERROR: te_graph_trigger: Transition
> failed: terminated
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Graph 63579 (9
> actions in 9 synapses): batch-limit=30 jobs, network-delay=60000ms
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 0 was
> confirmed (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 1 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 8]:
> Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 103]:
> Pending (id: Rsc2_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 2 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 97]:
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 8]:
> Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 3 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 98]:
> Pending (id: Rsc1_GroupClusterIP_monitor_1000, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 4 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 99]:
> Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 5 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 100]:
> Pending (id: Rsc3_monitor_1000, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 99]:
> Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 6 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 101]:
> Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 7 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 102]:
> Pending (id: Rsc4_monitor_1000, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 101]:
> Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 8 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 36]:
> Pending (id: all_stopped, type: pseduo, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 8]:
> Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: te_graph_trigger: Transition 63579
> is now complete
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: notify_crmd: Transition 63579
> status: done - <null>
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
> 1) Please help as to why a monitor is timing out ?
> 2) Why does one of the VIP's fails to recover after a timeout ?
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
Attachments: IPaddr2a (25.0 KB)


parshvi.17 at gmail

Aug 9, 2012, 10:44 PM

Post #4 of 9 (417 views)
Permalink
Re: Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover. [In reply to]

T
Mario Penners <mario.penners@...> writes:

>
> Hi Parshvi,
>
> just a quick-shot and without analyzing your mail in detail: find
> attached an edited version of the IPaddr2 RA.
>
> I was trying to use the original script a while agho, and basically
> nothing worked: It did not recognize the link failures (due to the way
> how the test was implemented it would only work if you have not more
> than 1 IP per interface), there was no proper support for bonding, the
> IP addresses would not be shifted ....
>
> I did some (very minor) changes to ge the script working for us. Just
> have a shot at it if you want, maybe replacing the RA will already solve
> your problem.
>
> Cheers,
> Mario
>
> On Thu, 2012-08-09 at 05:44 +0000, Parshvi wrote:
> > Parshvi <parshvi.17@...> writes:
> >
> > >
> > > Hi,
> > >
> > > The monitor operation of IPaddr2 rsc agent is timing out.
> > > Interval: 5s
> > > Timeout: 60s
> > > The timeout was increased from an earlier 20s to now 60s. Even then, there
are
> > > multiple logs of monitor op. timing out.
> > >
> > > 1) What can cause the monitor to take so long ?
> > > 2) Looking at the pe-input, what contributes to the operation time ? Is it
> > just
> > > the exec-time or exec-time + queue-time ?
> > > 3) Any solution proposed ?
> > >
>

Thanks Mario for your input.

The are some more findings:
1) The monitor is not timing out in all environments. I have been through some
of the forum mails, and came across people talking about "heavy load on the
system" wrt the timeout issue.
2) Could somebody explain, what exactly are we referring to when we say "heavy
load" ? Also, how does it affect an operations execution ?
3) THE OPERATION MONITOR IS TIMING OUT ON OTHER RESOURCES TOO( ALONG WITH
IPADDR2).
4) None of these operations were timing out in a local environment.

I added some logging in IPaddr2 resource agent script.
In func. ip_monitor(),I have printed the date at enter monitor and at exit
monitor func.
This is what I observed for :
interval=5s
timeout=60s

enter monitor Thu Aug 9 06:26:28 AST 2012
exit monitor Thu Aug 9 06:26:28 AST 2012

enter monitor Thu Aug 9 06:26:36 AST 2012
exit monitor Thu Aug 9 06:26:36 AST 2012

[The next monitor was invoked after 71 seconds]

enter monitor Thu Aug 9 06:27:47 AST 2012
exit monitor Thu Aug 9 06:27:47 AST 2012

enter monitor Thu Aug 9 06:27:52 AST 2012
exit monitor Thu Aug 9 06:27:52 AST 2012



_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


dejanmm at fastmail

Aug 13, 2012, 8:56 AM

Post #5 of 9 (393 views)
Permalink
Re: Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover. [In reply to]

Hi Mario,

On Thu, Aug 09, 2012 at 10:52:39AM -0300, Mario Penners wrote:
> Hi Parshvi,
>
> just a quick-shot and without analyzing your mail in detail: find
> attached an edited version of the IPaddr2 RA.
>
> I was trying to use the original script a while agho, and basically
> nothing worked: It did not recognize the link failures (due to the way
> how the test was implemented it would only work if you have not more
> than 1 IP per interface), there was no proper support for bonding, the
> IP addresses would not be shifted ....
>
>
> I did some (very minor) changes to ge the script working for us. Just
> have a shot at it if you want, maybe replacing the RA will already solve
> your problem.

Do you think you have anything to contribute? There's currently
an effort going on to update the IPaddr2:
https://github.com/ClusterLabs/resource-agents/pull/97.

According to the comments at the top of your script, it would be
good to have your patches on board too. But your RA seems to be
based on an outdated version of IPaddr2.

Cheers,

Dejan

> Cheers,
> Mario
>
>
> On Thu, 2012-08-09 at 05:44 +0000, Parshvi wrote:
> > Parshvi <parshvi.17@...> writes:
> >
> > >
> > > Hi,
> > >
> > > The monitor operation of IPaddr2 rsc agent is timing out.
> > > Interval: 5s
> > > Timeout: 60s
> > > The timeout was increased from an earlier 20s to now 60s. Even then, there are
> > > multiple logs of monitor op. timing out.
> > >
> > > 1) What can cause the monitor to take so long ?
> > > 2) Looking at the pe-input, what contributes to the operation time ? Is it
> > just
> > > the exec-time or exec-time + queue-time ?
> > > 3) Any solution proposed ?
> > >
> > > I have lrm pe-input when the timeout was configured at 20s:
> > > Here, is pe-input snapshot where monitor op. timed out (with timeout=20s)
> > >
> > > <lrm_resource id="Group_1_ClusterIP" type="IPaddr2" class="ocf"
> > > provider="heartbeat">
> > > <lrm_rsc_op id="Group_1_ClusterIP_monitor_0" operation="monitor"
> > > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1" transition-
> > > key="28:0:7:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > > magic="0:7;28:0:7:6b445452-980a-455f-8616-7bd12f20843e" call-id="10" rc-
> > code="7"
> > > op-status="0" interval="0" last-run="1343738096" last-rc-change="1343738096"
> > > exec-time="20" queue-time="30" op-digest="f22a042c86b227078b239707d4e4d4a2"/>
> > >
> > > <lrm_rsc_op id="Group_1_ClusterIP_start_0" operation="start" crm-
> > > debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> > > key="87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > > magic="0:0;87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83503"
> > rc-
> > > code="0" op-status="0" interval="0" last-run="1343928908" last-rc-
> > > change="1343928908" exec-time="280" queue-time="20" op-
> > > digest="f22a042c86b227078b239707d4e4d4a2"/>
> > >
> > > <lrm_rsc_op id="Group_1_ClusterIP_monitor_5000"
> > operation="monitor"
> > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> > > key="12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > > magic="2:-2;12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83504"
> > rc-
> > > code="-2" op-status="2" interval="5000" last-rc-change="1343928921" exec-
> > > time="20000" queue-time="0" op-digest="79c3bdd01c6e0fd819484536a54bf7a2"/>
> > > (Please note exec-time=20000)
> > >
> >
> > Following are the details of packages:
> > cluster-glue: 1.0.6 (1c87a0c58c59fc384b93ec11476cefdbb6ddc1e1)
> > resource-agents: # Build version: 7a11934b142d1daf42a04fbaa0391a3ac47cee4c
> > CRM Version: 1.0.12 (unknown)
> > pacemaker 1.0.12-1.el5.centos - (none) x86_64
> > corosync 1.2.7-1.1.el5 - (none) x86_64
> > resource-agents 1.0.4-1.1.el5 - (none) x86_64
> >
> > There are 4 virtual IP resources configued:
> >
> > Out of these, 3 recover after a monitor timeout but one Virtual IP rsc does not
> > recover. Following are the logs that are observed:
> >
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: run_graph: Transition 63579
> > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=8,
> > Source=/var/lib/pengine/pe-input-1660.bz2): Terminated
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: ERROR: te_graph_trigger: Transition
> > failed: terminated
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Graph 63579 (9
> > actions in 9 synapses): batch-limit=30 jobs, network-delay=60000ms
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 0 was
> > confirmed (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 1 is pending
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 8]:
> > Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 103]:
> > Pending (id: Rsc2_stop_0, loc: CSS-FU-1, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 2 is pending
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 97]:
> > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 8]:
> > Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 3 is pending
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 98]:
> > Pending (id: Rsc1_GroupClusterIP_monitor_1000, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
> > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 4 is pending
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 99]:
> > Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
> > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 5 is pending
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 100]:
> > Pending (id: Rsc3_monitor_1000, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 99]:
> > Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 6 is pending
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 101]:
> > Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
> > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 7 is pending
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 102]:
> > Pending (id: Rsc4_monitor_1000, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 101]:
> > Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 8 is pending
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 36]:
> > Pending (id: all_stopped, type: pseduo, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 8]:
> > Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: te_graph_trigger: Transition 63579
> > is now complete
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: notify_crmd: Transition 63579
> > status: done - <null>
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: do_state_transition: State
> > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > cause=C_FSA_INTERNAL origin=notify_crmd ]
> >
> > 1) Please help as to why a monitor is timing out ?
> > 2) Why does one of the VIP's fails to recover after a timeout ?
> >
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker [at] oss
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>


> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


dejanmm at fastmail

Aug 13, 2012, 9:03 AM

Post #6 of 9 (403 views)
Permalink
Re: Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover. [In reply to]

Hi,

On Fri, Aug 10, 2012 at 05:44:47AM +0000, Parshvi wrote:
> T
> Mario Penners <mario.penners@...> writes:
>
> >
> > Hi Parshvi,
> >
> > just a quick-shot and without analyzing your mail in detail: find
> > attached an edited version of the IPaddr2 RA.
> >
> > I was trying to use the original script a while agho, and basically
> > nothing worked: It did not recognize the link failures (due to the way
> > how the test was implemented it would only work if you have not more
> > than 1 IP per interface), there was no proper support for bonding, the
> > IP addresses would not be shifted ....
> >
> > I did some (very minor) changes to ge the script working for us. Just
> > have a shot at it if you want, maybe replacing the RA will already solve
> > your problem.
> >
> > Cheers,
> > Mario
> >
> > On Thu, 2012-08-09 at 05:44 +0000, Parshvi wrote:
> > > Parshvi <parshvi.17@...> writes:
> > >
> > > >
> > > > Hi,
> > > >
> > > > The monitor operation of IPaddr2 rsc agent is timing out.
> > > > Interval: 5s
> > > > Timeout: 60s
> > > > The timeout was increased from an earlier 20s to now 60s. Even then, there
> are
> > > > multiple logs of monitor op. timing out.
> > > >
> > > > 1) What can cause the monitor to take so long ?
> > > > 2) Looking at the pe-input, what contributes to the operation time ? Is it
> > > just
> > > > the exec-time or exec-time + queue-time ?
> > > > 3) Any solution proposed ?
> > > >
> >
>
> Thanks Mario for your input.
>
> The are some more findings:
> 1) The monitor is not timing out in all environments. I have been through some
> of the forum mails, and came across people talking about "heavy load on the
> system" wrt the timeout issue.
> 2) Could somebody explain, what exactly are we referring to when we say "heavy
> load" ? Also, how does it affect an operations execution ?

Heavy load, as in many processes contending for system
resources such as CPU or disk.

> 3) THE OPERATION MONITOR IS TIMING OUT ON OTHER RESOURCES TOO( ALONG WITH
> IPADDR2).

That seems to indicate that indeed there's a load which your
computer cannot sustain. BTW, why uppercase?

> 4) None of these operations were timing out in a local environment.
>
> I added some logging in IPaddr2 resource agent script.
> In func. ip_monitor(),I have printed the date at enter monitor and at exit
> monitor func.
> This is what I observed for :
> interval=5s
> timeout=60s
>
> enter monitor Thu Aug 9 06:26:28 AST 2012
> exit monitor Thu Aug 9 06:26:28 AST 2012
>
> enter monitor Thu Aug 9 06:26:36 AST 2012
> exit monitor Thu Aug 9 06:26:36 AST 2012
>
> [The next monitor was invoked after 71 seconds]
>
> enter monitor Thu Aug 9 06:27:47 AST 2012
> exit monitor Thu Aug 9 06:27:47 AST 2012
>
> enter monitor Thu Aug 9 06:27:52 AST 2012
> exit monitor Thu Aug 9 06:27:52 AST 2012

There's also code preventing more than n (by default 4)
operations running in parallel on a single node. That could be
one explanation of larger intervals between monitors.

Thanks,

Dejan

> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


jsmith at argotec

Aug 13, 2012, 9:52 AM

Post #7 of 9 (399 views)
Permalink
Re: IPaddr2 RA update - clusterIP -> cluster-match? [In reply to]

Dejan,

Regarding your comment to another post:

> Do you think you have anything to contribute? There's currently
> an effort going on to update the IPaddr2:
> https://github.com/ClusterLabs/resource-agents/pull/97.

Is there any plans to update it to use cluster-match instead of clusterip for clones? I read through briefly but didn't see any mention...

I'm not up to the task unfortunately but would like to see improvements in that area!

Jake

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


dejanmm at fastmail

Aug 14, 2012, 3:08 AM

Post #8 of 9 (397 views)
Permalink
Re: IPaddr2 RA update - clusterIP -> cluster-match? [In reply to]

Hi,

On Mon, Aug 13, 2012 at 12:52:41PM -0400, Jake Smith wrote:
> Dejan,
>
> Regarding your comment to another post:
>
> > Do you think you have anything to contribute? There's currently
> > an effort going on to update the IPaddr2:
> > https://github.com/ClusterLabs/resource-agents/pull/97.
>
> Is there any plans to update it to use cluster-match instead of clusterip for clones? I read through briefly but didn't see any mention...
>
> I'm not up to the task unfortunately but would like to see improvements in that area!

I didn't hear anything about it. But you can add a proposal at
github and somebody may be interested enough to implement it.

Cheers,

Dejan

> Jake
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


parshvi.17 at gmail

Aug 15, 2012, 10:31 PM

Post #9 of 9 (382 views)
Permalink
Re: Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover. [In reply to]

Dejan Muhamedagic <dejanmm@...> writes:

Thanks Dejan for your inputs.

> > The are some more findings:
> > 1) The monitor is not timing out in all environments. I have been through
some
> > of the forum mails, and came across people talking about "heavy load on the
> > system" wrt the timeout issue.
> > 2) Could somebody explain, what exactly are we referring to when we say
"heavy
> > load" ? Also, how does it affect an operations execution ?
>
> Heavy load, as in many processes contending for system
> resources such as CPU or disk.

The server has 8 processors (4 cores). Analyzing the io stats, there isn't any
contention on CPU. As for the disk, there appears to be high io wait at times.
But this is on the shared storage, on local disk the no. of read/writes/sec is
very low.
>
> > 3) THE OPERATION MONITOR IS TIMING OUT ON OTHER RESOURCES TOO( ALONG WITH
> > IPADDR2).
>
> That seems to indicate that indeed there's a load which your
> computer cannot sustain. BTW, why uppercase?

Since, there appears to be sustainable load, could you suggest some solution ?
I considered the issue(point 3) critical, hence highlighted with upper case.

>
> > 4) None of these operations were timing out in a local environment.
> >
> > I added some logging in IPaddr2 resource agent script.
> > In func. ip_monitor(),I have printed the date at enter monitor and at exit
> > monitor func.
> > This is what I observed for :
> > interval=5s
> > timeout=60s
> >
> > enter monitor Thu Aug 9 06:26:28 AST 2012
> > exit monitor Thu Aug 9 06:26:28 AST 2012
> >
> > enter monitor Thu Aug 9 06:26:36 AST 2012
> > exit monitor Thu Aug 9 06:26:36 AST 2012
> >
> > [The next monitor was invoked after 71 seconds]
> >
> > enter monitor Thu Aug 9 06:27:47 AST 2012
> > exit monitor Thu Aug 9 06:27:47 AST 2012
> >
> > enter monitor Thu Aug 9 06:27:52 AST 2012
> > exit monitor Thu Aug 9 06:27:52 AST 2012
>
> There's also code preventing more than n (by default 4)
> operations running in parallel on a single node. That could be
> one explanation of larger intervals between monitors.

I suppose the parameter you're referring to is max-children of lrm.
We have set that to 34 in our case, since we were hitting max-child count. The
number of resources configured in pacemaker is 19.

>
> Thanks,
>
> Dejan






_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.