
sussox at gmail
Jul 3, 2009, 12:30 AM
Post #1 of 1
(222 views)
Permalink
|
|
Weird HA-behavior with XEN/HA/DRBD/LVM
|
|
I have a HA-setup with Xen heartbeat lvm and DRBD. According to this howto: http://www.asplund.nu/xencluster/xen-cluster-howto.html Im using 2 Poweredge R710. ha1 & ha2 with Ubuntu 8.04 I can do a manual live migration of the domU's without any problems. Also if i shutdown heartbeat on ha1 with init.d/heartbeat the domU's are migrated to ha2 sucessfully. Also, if i pull the plug on ha1, after a while the domU's start on ha2 (as they should). However! When doing a "reboot" on ha1, domU's begin to migrate but then crashes on ha2. pasting ha-debug and xend.log below. Any ides why it keeps doing this? All i can think of is that some process is being killed to fast (when the domU's are beeing migrated, but i don't know what to look for.) Also, i ran a test couple of weeks ago with the same setup but on one R710 and a older shuttle and then there was no problem. Tried to redo the howto twice but with the same problem Cheers! /Sussox ha-debug: Code: heartbeat[28295]: 2009/06/30_13:46:12 info: Received shutdown notice from 'ha1.vbm.se'. heartbeat[28295]: 2009/06/30_13:46:12 info: Resources being acquired from ha1.vbm.se. heartbeat[28295]: 2009/06/30_13:46:12 debug: StartNextRemoteRscReq(): child count 1 heartbeat[29445]: 2009/06/30_13:46:12 info: acquire local HA resources (standby). ResourceManager[29472]: 2009/06/30_13:46:12 info: Acquiring resource group: ha2.vbm.se xendomainsHA2 heartbeat[29446]: 2009/06/30_13:46:13 info: Local Resource acquisition completed. heartbeat[28295]: 2009/06/30_13:46:13 debug: StartNextRemoteRscReq(): child count 2 heartbeat[28295]: 2009/06/30_13:46:13 debug: StartNextRemoteRscReq(): child count 1 ResourceManager[29472]: 2009/06/30_13:46:13 info: Running /etc/ha.d/resource.d/xendomainsHA2 start ResourceManager[29472]: 2009/06/30_13:46:13 debug: Starting /etc/ha.d/resource.d/xendomainsHA2 start ResourceManager[29472]: 2009/06/30_13:46:13 debug: /etc/ha.d/resource.d/xendomainsHA2 start done. RC=0 heartbeat[29445]: 2009/06/30_13:46:13 info: local HA resource acquisition completed (standby). heartbeat[28295]: 2009/06/30_13:46:13 info: Standby resource acquisition done [foreign]. heartbeat[29559]: 2009/06/30_13:46:13 debug: notify_world: setting SIGCHLD Handler to SIG_DFL harc[29559]: 2009/06/30_13:46:13 info: Running /etc/ha.d/rc.d/status status mach_down[29573]: 2009/06/30_13:46:13 info: Taking over resource group xendomainsHA1 ResourceManager[29597]: 2009/06/30_13:46:13 info: Acquiring resource group: ha1.vbm.se xendomainsHA1 ResourceManager[29597]: 2009/06/30_13:46:13 info: Running /etc/ha.d/resource.d/xendomainsHA1 start ResourceManager[29597]: 2009/06/30_13:46:13 debug: Starting /etc/ha.d/resource.d/xendomainsHA1 start Starting auto Xen domains: hejsan(skip) * [done] ResourceManager[29597]: 2009/06/30_13:46:13 debug: /etc/ha.d/resource.d/xendomainsHA1 start done. RC=0 mach_down[29573]: 2009/06/30_13:46:13 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[29573]: 2009/06/30_13:46:13 info: mach_down takeover complete for node ha1.vbm.se. heartbeat[28295]: 2009/06/30_13:46:13 info: mach_down takeover complete. heartbeat[29696]: 2009/06/30_13:46:13 debug: notify_world: setting SIGCHLD Handler to SIG_DFL harc[29696]: 2009/06/30_13:46:13 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp ip-request-resp[29696]: 2009/06/30_13:46:13 received ip-request-resp xendomainsHA2 OK yes ResourceManager[29715]: 2009/06/30_13:46:13 info: Acquiring resource group: ha2.vbm.se xendomainsHA2 ResourceManager[29715]: 2009/06/30_13:46:13 info: Running /etc/ha.d/resource.d/xendomainsHA2 start ResourceManager[29715]: 2009/06/30_13:46:13 debug: Starting /etc/ha.d/resource.d/xendomainsHA2 start ResourceManager[29715]: 2009/06/30_13:46:13 debug: /etc/ha.d/resource.d/xendomainsHA2 start done. RC=0 heartbeat[28295]: 2009/06/30_13:46:24 WARN: node ha1.vbm.se: is dead heartbeat[28295]: 2009/06/30_13:46:24 info: Dead node ha1.vbm.se gave up resources. heartbeat[28295]: 2009/06/30_13:46:24 info: Link ha1.vbm.se:eth0 dead. xend.log Code: [2009-06-30 13:46:11 5499] DEBUG (XendCheckpoint:210) restore:shadow=0x0, _static_max=0x18000000, _static_min=0x0, [2009-06-30 13:46:11 5499] DEBUG (balloon:151) Balloon: 398436 KiB free; need 393216; done. [2009-06-30 13:46:11 5499] DEBUG (XendCheckpoint:227) [xc_restore]: /usr/lib/xen/bin/xc_restore 4 7 1 2 0 0 0 [2009-06-30 13:46:11 5499] INFO (XendCheckpoint:365) xc_domain_restore start: p2m_size = 18800 [2009-06-30 13:46:11 5499] INFO (XendCheckpoint:365) Reloading memory pages: 0% [2009-06-30 13:46:14 5499] INFO (XendCheckpoint:365) ERROR Internal error: Error when reading page (type was 0) [2009-06-30 13:46:14 5499] INFO (XendCheckpoint:365) Restore exit with rc=1 [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1913) XendDomainInfo.destroy: domid=7 [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1930) XendDomainInfo.destroyDomain(7) [2009-06-30 13:46:14 5499] ERROR (XendDomainInfo:1942) XendDomainInfo.destroy: xc.domain_destroy failed. Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/xen/xend/XendDomainInfo.py", line 1937, in destroyDomain xc.domain_destroy(self.domid) Error: (3, 'No such process') [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1553) No device model [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1555) Releasing devices [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing vif/0 [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590) XendDomainInfo.destroyDevice: deviceClass = vif, device = vif/0 [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing vbd/51713 [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590) XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51713 [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing vbd/51714 [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590) XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51714 [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing console/0 [2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590) XendDomainInfo.destroyDevice: deviceClass = console, device = console/0 [2009-06-30 13:46:14 5499] ERROR (XendDomain:1136) Restore failed Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/xen/xend/XendDomain.py", line 1134, in domain_restore_fd return XendCheckpoint.restore(self, fd, paused=paused) File "/usr/lib/python2.5/site-packages/xen/xend/XendCheckpoint.py", line 231, in restore forkHelper(cmd, fd, handler.handler, True) File "/usr/lib/python2.5/site-packages/xen/xend/XendCheckpoint.py", line 353, in forkHelper raise XendError("%s failed" % string.join(cmd)) XendError: /usr/lib/xen/bin/xc_restore 4 7 1 2 0 0 0 failed What xendomainsHA2 does: #!/bin/bash # # /etc/init.d/xendomains # Start / stop domains automatically when domain 0 boots / shuts down. # # chkconfig: 345 99 00 # description: Start / stop Xen domains. # # This script offers fairly basic functionality. It should work on Redhat # but also on LSB-compliant SuSE releases and on Debian with the LSB package # installed. (LSB is the Linux Standard Base) # # Based on the example in the "Designing High Quality Integrated Linux # Applications HOWTO" by Avi Alkalay # <http://www.tldp.org/HOWTO/HighQuality-Apps-HOWTO/> # ### BEGIN INIT INFO # Provides: xendomains # Required-Start: $syslog $remote_fs xend # Should-Start: # Required-Stop: $syslog $remote_fs xend # Should-Stop: # Default-Start: 3 4 5 # Default-Stop: 0 1 2 6 # Short-Description: Start/stop secondary xen domains # Description: Start / stop domains automatically when domain 0 # boots / shuts down. ### END INIT INFO # Correct exit code would probably be 5, but it's enough # if xend complains if we're not running as privileged domain if ! [ -e /proc/xen/privcmd ]; then exit 0 fi LOCKFILE=/var/lock/xendomainsHA2 XENDOM_CONFIG=/etc/default/xendomainsHA2 test -r $XENDOM_CONFIG || { echo "$XENDOM_CONFIG not existing"; if [ "$1" = "stop" ]; then exit 0; else exit 6; fi; } . $XENDOM_CONFIG # Use the SUSE rc_ init script functions; # emulate them on LSB, RH and other systems if test -e /etc/rc.status; then # SUSE rc script library . /etc/rc.status else _cmd=$1 declare -a _SMSG if test "${_cmd}" = "status"; then _SMSG=(running dead dead unused unknown) _RC_UNUSED=3 else _SMSG=(done failed failed missed failed skipped unused failed failed) _RC_UNUSED=6 fi if test -e /etc/init.d/functions; then # REDHAT . /etc/init.d/functions echo_rc() { #echo -n " [${_SMSG[${_RC_RV}]}] " if test ${_RC_RV} = 0; then success " [${_SMSG[${_RC_RV}]}] " else failure " [${_SMSG[${_RC_RV}]}] " fi } elif test -e /lib/lsb/init-functions; then # LSB . /lib/lsb/init-functions if alias log_success_msg >/dev/null 2>/dev/null; then echo_rc() { echo " [${_SMSG[${_RC_RV}]}] " } else echo_rc() { if test ${_RC_RV} = 0; then log_success_msg " [${_SMSG[${_RC_RV}]}] " else log_failure_msg " [${_SMSG[${_RC_RV}]}] " fi } fi else # emulate it echo_rc() { echo " [${_SMSG[${_RC_RV}]}] " } fi rc_reset() { _RC_RV=0; } rc_failed() { if test -z "$1"; then _RC_RV=1; elif test "$1" != "0"; then _RC_RV=$1; fi return ${_RC_RV} } rc_check() { return rc_failed $? } rc_status() { rc_failed $? if test "$1" = "-r"; then _RC_RV=0; shift; fi if test "$1" = "-s"; then rc_failed 5; echo_rc; rc_failed 3; shift; fi if test "$1" = "-u"; then rc_failed ${_RC_UNUSED}; echo_rc; rc_failed 3; shift; fi if test "$1" = "-v"; then echo_rc; shift; fi if test "$1" = "-r"; then _RC_RV=0; shift; fi return ${_RC_RV} } rc_exit() { exit ${_RC_RV}; } rc_active() { if test -z "$RUNLEVEL"; then read RUNLEVEL REST < <(/sbin/runlevel); fi if test -e /etc/init.d/S[0-9][0-9]${1}; then return 0; fi return 1 } fi if ! which usleep >&/dev/null then usleep() { if [ -n "$1" ] then sleep $(( $1 / 1000000 )) fi } fi # Reset status of this service rc_reset ## # Returns 0 (success) if the given parameter names a directory, and that # directory is not empty. # contains_something() { if [ -d "$1" ] && [ `/bin/ls $1 | wc -l` -gt 0 ] then return 0 else return 1 fi } # read name from xen config file rdname() { NM=$(xm create --quiet --dryrun --defconfig "$1" | sed -n 's/^.*(name \(.*\))$/\1/p') } rdnames() { NAMES= if ! contains_something "$XENDOMAINS_AUTO" then return fi for dom in $XENDOMAINS_AUTO/*; do rdname $dom if test -z $NAMES; then NAMES=$NM; else NAMES="$NAMES|$NM" fi done } parseln() { name=`echo "$1" | cut -d\ -f1` name=${name%% *} rest=`echo "$1" | cut -d\ -f2-` read id mem cpu vcpu state tm < <(echo "$rest") } is_running() { rdname $1 RC=1 while read LN; do parseln "$LN" if test "$id" = "0"; then continue; fi case $name in ($NM) RC=0 ;; esac done < <(xm list | grep -v '^Name') return $RC } start() { if [ -f $LOCKFILE ]; then echo -n "xendomains already running (lockfile exists)" return; fi saved_domains=" " if [ "$XENDOMAINS_RESTORE" = "true" ] && contains_something "$XENDOMAINS_SAVE" then mkdir -p $(dirname "$LOCKFILE") touch $LOCKFILE echo -n "Restoring Xen domains:" saved_domains=`ls $XENDOMAINS_SAVE` for dom in $XENDOMAINS_SAVE/*; do echo -n " ${dom##*/}" xm restore $dom if [ $? -ne 0 ]; then rc_failed $? echo -n '!' else # mv $dom ${dom%/*}/.${dom##*/} rm $dom fi done echo . fi if contains_something "$XENDOMAINS_AUTO" then touch $LOCKFILE echo -n "Starting auto Xen domains:" # We expect config scripts for auto starting domains to be in # XENDOMAINS_AUTO - they could just be symlinks to files elsewhere # Create all domains with config files in XENDOMAINS_AUTO. # TODO: We should record which domain name belongs # so we have the option to selectively shut down / migrate later # If a domain statefile from $XENDOMAINS_SAVE matches a domain name # in $XENDOMAINS_AUTO, do not try to start that domain; if it didn't # restore correctly it requires administrative attention. for dom in $XENDOMAINS_AUTO/*; do echo -n " ${dom##*/}" shortdom=$(echo $dom | sed -n 's/^.*\/\(.*\)$/\1/p') echo $saved_domains | grep -w $shortdom > /dev/null if [ $? -eq 0 ] || is_running $dom; then echo -n "(skip)" else xm create --quiet --defconfig $dom if [ $? -ne 0 ]; then rc_failed $? echo -n '!' else usleep $XENDOMAINS_CREATE_USLEEP fi fi done fi } all_zombies() { while read LN; do parseln "$LN" if test $id = 0; then continue; fi if test "$state" != "-b---d" -a "$state" != "-----d"; then return 1; fi done < <(xm list | grep -v '^Name') return 0 } # Wait for max $XENDOMAINS_STOP_MAXWAIT for xm $1 to finish; # if it has not exited by that time kill it, so the init script will # succeed within a finite amount of time; if $2 is nonnull, it will # kill the command as well as soon as no domain (except for zombies) # are left (used for shutdown --all). watchdog_xm() { if test -z "$XENDOMAINS_STOP_MAXWAIT" -o "$XENDOMAINS_STOP_MAXWAIT" = "0"; then exit fi usleep 20000 for no in `seq 0 $XENDOMAINS_STOP_MAXWAIT`; do # exit if xm save/migrate/shutdown is finished PSAX=`ps axlw | grep "xm $1" | grep -v grep` if test -z "$PSAX"; then exit; fi echo -n "."; sleep 1 # go to kill immediately if there's only zombies left if all_zombies && test -n "$2"; then break; fi done sleep 1 read PSF PSUID PSPID PSPPID < <(echo "$PSAX") # kill xm $1 kill $PSPID >/dev/null 2>&1 } stop() { # Collect list of domains to shut down if test "$XENDOMAINS_AUTO_ONLY" = "true"; then rdnames fi echo -n "Shutting down Xen domains:" while read LN; do parseln "$LN" if test $id = 0; then continue; fi echo -n " $name" if test "$XENDOMAINS_AUTO_ONLY" = "true"; then case $name in ($NAMES) # nothing ;; (*) echo -n "(skip)" continue ;; esac fi # XENDOMAINS_SYSRQ chould be something like just "s" # or "s e i u" or even "s e s i u o" # for the latter, you should set XENDOMAINS_USLEEP to 1200000 or so if test -n "$XENDOMAINS_SYSRQ"; then for sysrq in $XENDOMAINS_SYSRQ; do echo -n "(SR-$sysrq)" xm sysrq $id $sysrq if test $? -ne 0; then rc_failed $? echo -n '!' fi # usleep just ignores empty arg usleep $XENDOMAINS_USLEEP done fi if test "$state" = "-b---d" -o "$state" = "-----d"; then echo -n "(zomb)" continue fi if test -n "$XENDOMAINS_MIGRATE"; then echo -n "(migr)" watchdog_xm migrate & WDOG_PID=$! xm migrate $id $XENDOMAINS_MIGRATE if test $? -ne 0; then rc_failed $? echo -n '!' kill $WDOG_PID >/dev/null 2>&1 else kill $WDOG_PID >/dev/null 2>&1 continue fi fi if test -n "$XENDOMAINS_SAVE"; then echo -n "(save)" watchdog_xm save & WDOG_PID=$! mkdir -p "$XENDOMAINS_SAVE" xm save $id $XENDOMAINS_SAVE/$name if test $? -ne 0; then rc_failed $? echo -n '!' kill $WDOG_PID >/dev/null 2>&1 else kill $WDOG_PID >/dev/null 2>&1 continue fi fi if test -n "$XENDOMAINS_SHUTDOWN"; then # XENDOMAINS_SHUTDOWN should be "--halt --wait" echo -n "(shut)" watchdog_xm shutdown & WDOG_PID=$! xm shutdown $id $XENDOMAINS_SHUTDOWN if test $? -ne 0; then rc_failed $? echo -n '!' fi kill $WDOG_PID >/dev/null 2>&1 fi done < <(xm list | grep -v '^Name') # NB. this shuts down ALL Xen domains (politely), not just the ones in # AUTODIR/* # This is because it's easier to do ;-) but arguably if this script is run # on system shutdown then it's also the right thing to do. if ! all_zombies && test -n "$XENDOMAINS_SHUTDOWN_ALL"; then # XENDOMAINS_SHUTDOWN_ALL should be "--all --halt --wait" echo -n " SHUTDOWN_ALL " watchdog_xm shutdown 1 & WDOG_PID=$! xm shutdown $XENDOMAINS_SHUTDOWN_ALL if test $? -ne 0; then rc_failed $? echo -n '!' fi kill $WDOG_PID >/dev/null 2>&1 fi # Unconditionally delete lock file rm -f $LOCKFILE } check_domain_up() { while read LN; do parseln "$LN" if test $id = 0; then continue; fi case $name in ($1) return 0 ;; esac done < <(xm list | grep -v "^Name") return 1 } check_all_auto_domains_up() { if ! contains_something "$XENDOMAINS_AUTO" then return 0 fi missing= for nm in $XENDOMAINS_AUTO/*; do rdname $nm found=0 if check_domain_up "$NM"; then echo -n " $name" else missing="$missing $NM" fi done if test -n "$missing"; then echo -n " MISS AUTO:$missing" return 1 fi return 0 } check_all_saved_domains_up() { if ! contains_something "$XENDOMAINS_SAVE" then return 0 fi missing=`/bin/ls $XENDOMAINS_SAVE` echo -n " MISS SAVED: " $missing return 1 } # This does NOT necessarily restart all running domains: instead it # stops all running domains and then boots all the domains specified in # AUTODIR. If other domains have been started manually then they will # not get restarted. # Commented out to avoid confusion! restart() { stop start } reload() { restart } case "$1" in start) start rc_status if test -f $LOCKFILE; then rc_status -v; fi ;; stop) stop rc_status -v ;; restart) restart ;; reload) reload ;; force-reload) reload ;; status) echo -n "Checking for xendomains:" if test ! -f $LOCKFILE; then rc_failed 3 else check_all_auto_domains_up rc_status check_all_saved_domains_up rc_status fi rc_status -v ;; *) echo "Usage: $0 {start|stop|restart|reload|status}" rc_failed 3 rc_status -v ;; esac rc_exit -- View this message in context: http://www.nabble.com/Weird-HA-behavior-with-XEN-HA-DRBD-LVM-tp24318862p24318862.html Sent from the Linux-HA mailing list archive at Nabble.com. _______________________________________________ Linux-HA mailing list Linux-HA[at]lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|