
dejanmm at fastmail
Jun 29, 2010, 8:50 AM
Post #2 of 2
(349 views)
Permalink
|
|
Re: PATCH VirtualDomain RA fix spurious stop failures
[In reply to]
|
|
Hi, On Mon, Jun 28, 2010 at 11:25:35AM +0200, Lars Ellenberg wrote: > > This fixes various problems leading to spurious stop failures: > > Don't fail on stop just because one virsh domstate got "no state". > Don't fail a "forced stop" of an already stopped resource. > Don't timeout in stop before escalating to "forced stop". > > Combination of the first two bugs frequently led to "failed stop", > where the stop in fact succeeded just fine. > > > diff -r 8cb5ba3e1d97 heartbeat/VirtualDomain > --- a/heartbeat/VirtualDomain Fri Jun 25 19:54:24 2010 +0200 > +++ b/heartbeat/VirtualDomain Mon Jun 28 11:20:42 2010 +0200 > @@ -24,6 +24,11 @@ > : ${OCF_RESKEY_hypervisor=${OCF_RESKEY_hypervisor_default}} > ####################################################################### > > +## I'd very much suggest to make this RA use bash, > +## and then use magic $SECONDS. > +## But for now: > +NOW=$(date +%s) > + Don't see why would that make such a difference. At any rate, we can't switch shells forward in the existing agents. > usage() { > echo "usage: $0 {start|stop|status|monitor|migrate_to|migrate_from|meta-data|validate-all}" > } > @@ -149,9 +154,11 @@ > } > > VirtualDomain_Status() { > + local try=0 > rc=$OCF_ERR_GENERIC > status="no state" > while [ "$status" = "no state" ]; do > + try=$(($try + 1 )) > status="`virsh $VIRSH_OPTIONS domstate $DOMAIN_NAME`" > case "$status" in > "shut off") > @@ -175,7 +182,7 @@ > # whenever virsh can't reliably obtain the domain > # state. > status="no state" > - if [ "$__OCF_ACTION" = "stop" ]; then > + if [ "$__OCF_ACTION" = "stop" ] && [ $try -ge 3 ]; then > # During the stop operation, we want to bail out > # quickly, so as to be able to force-stop (destroy) > # the domain if necessary. > @@ -221,6 +228,7 @@ > local i > local status > local shutdown_timeout > + local out ex > > VirtualDomain_Status > status=$? > @@ -233,9 +241,9 @@ > virsh $VIRSH_OPTIONS shutdown ${DOMAIN_NAME} > # The "shutdown_timeout" we use here is the operation > # timeout specified in the CIB, minus 5 seconds > - shutdown_timeout=$((($OCF_RESKEY_CRM_meta_timeout/1000)-5)) > - # Loop on status for $shutdown_timeout seconds > - for i in `seq $shutdown_timeout`; do > + shutdown_timeout=$(( $NOW + ($OCF_RESKEY_CRM_meta_timeout/1000) -5 )) > + # Loop on status until we reach $shutdown_timeout > + while [ $NOW -lt $shutdown_timeout ]; do > VirtualDomain_Status > status=$? > case $status in > @@ -256,6 +264,7 @@ > # resort to forced stop (destroy). > break; > esac > + NOW=$(date +%s) > done > fi > ;; > @@ -264,11 +273,24 @@ > return $OCF_SUCCESS > esac > # OK. Now if the above graceful shutdown hasn't worked, kill > - # off the domain with destroy. If that too does not work, give > - # up and have the LRM time us out. > + # off the domain with destroy. If that too does not work, > + # have the LRM time us out. > ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}." > - virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} || return $OCF_ERR_GENERIC > + out=$(virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) ; ex=$? Split into two lines? (easy to miss the second statement) > + echo >&2 "$out" > + # unconditionally clean up. > VirtualDomain_Cleanup_Statefile > + case $ex$out in > + *"error: Requested operation is not valid: domain is not running"*) > + : ;; # unexpected path to the intended outcome, all is well > + [!0]*) > + return $OCF_ERR_GENERIC ;; > + 0*) > + while [ $status != $OCF_NOT_RUNNING ]; do > + VirtualDomain_Status > + status=$? > + done ;; > + esac > return $OCF_SUCCESS > } The patch looks fine to me. Though the whole matter seems to be rather complicated. Florian, would you like to comment? Cheers, Dejan _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
|