
dg at doodle
Apr 8, 2012, 6:03 AM
Post #14 of 18
(1016 views)
Permalink
|
|
Re: ocf:heartbeat:apache resource agent and timeouts
[In reply to]
|
|
On 05.04.2012 17:14, Dejan Muhamedagic wrote: > Hmm, the process running the monitor operation should be removed > (killed) by lrmd on timeout. If that doesn't happen, then you > just hit a jackpot bug! Ok, that's crucial information I've been missing, and thus I misinterpreted my test results. Back to square one... TEST 1: *Unpatched* Apache resource agent with this configuration: root [at] node:/etc/ha.d# crm configure show node $id="aa9dea56-ae1e-42a9-a37b-f7c9f5dc5860" node1 node $id="aec6cf09-e141-415d-8957-a7b94e09df7f" node2 primitive apache ocf:heartbeat:apache \ params statusurl="http://localhost/server-status" \ op monitor interval="15s" timeout="5s" \ meta is-managed="false" clone apacheClone apache property $id="cib-bootstrap-options" \ dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1333886776" crm_mon shows Clone Set: apacheClone [apache] apache:0 (ocf::heartbeat:apache): Started node2 (unmanaged) apache:1 (ocf::heartbeat:apache): Started node1 (unmanaged) Thus all is well. Now I do $ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP After a few seconds, crm_mon shows Clone Set: apacheClone [apache] apache:0 (ocf::heartbeat:apache): Started node2 (unmanaged) apache:1 (ocf::heartbeat:apache): Started node1 (unmanaged) FAILED Failed actions: apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed Out): unknown exec error Using ps aux, I can see that the monitor and wget is started every 15s and running up to the timeout, and then killed, just as you said. So far so good. Now I remove the iptables rule: $ iptables -F But no matter how long I wait, Pacemaker *doesn't* notice that Apache is back! Even though the monitor is definitely executed (I can see the request in Apache's log file). Also, crm_mon keeps saying Failed actions: apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed Out): unknown exec error The counters don't change (!) If I manually do $ crm resource cleanup apacheClone then everything is fine again. TEST 2: *Patched* Apache resource agent with the same configuration. root [at] node:/usr/lib/ocf/resource.d/heartbeat# diff apache apache.orig 66c66 < WGETOPTS="-O- -q -L --no-proxy -T 3 -t 1 --bind-address=127.0.0.1" --- > WGETOPTS="-O- -q -L --no-proxy --bind-address=127.0.0.1" So all I did was add two options to wget's command line. Again, crm_mon shows that all is well. Again I do $ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP Now crm_mon shows Clone Set: apacheClone [apache] apache:0 (ocf::heartbeat:apache): Started node2 (unmanaged) apache:1 (ocf::heartbeat:apache): Started node1 (unmanaged) FAILED Failed actions: apache:0_monitor_15000 (node=node1, call=13, rc=1, status=complete): unknown error NOTE: The "Failed actions" are different from the test before! Now I remove the iptables rule: $ iptables -F After a few seconds, the clone set is back to working state. Thus, what I'm seeing here: It does make a difference to Pacemaker whether the monitor operation returns failure or times out. Monitor times out: * apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed Out): unknown exec error * Monitor operation and wget both get killed when the timeout happens (just as they should) * Monitor operation keeps getting executed (and presumably returns success), but this is ignored (!) by Pacemaker Monitor returns failure (due to wget's timeout): * apache:0_monitor_15000 (node=node1, call=13, rc=1, status=complete): unknown error * Monitor operation and wget don't need to be killed, because they time out and complete before the whole monitor operation times out * Monitor operation keeps getting executed, and on first success Pacemakers notices and puts apache back into working state The big question here is: Is this a bug in Pacemaker or by design? > Hmm, I though we were past this... and I still don't see the > patch :) I'm still not sure what the actual problem is. Currently I feel like it's a bug in Pacemaker, and my "fix" for the apache resource agent is just fighting symptoms. Sorry for the confusion - This Heartbeat/Pacemaker thing is very hard to understand. Best regards, David _______________________________________________ Linux-HA mailing list Linux-HA [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|