maik.brauer at mbs-systems
Sep 11, 2012, 3:46 PM
Post #11 of 26
On Sep 10, 2012, at 5:10 PM, Ian Campbell wrote:
Re: Dom0 crashed when rebooting whilst DomU are running
[In reply to]
> On Mon, 2012-09-10 at 16:00 +0100, Maik Brauer wrote:
>> On Sep 10, 2012, at 10:39 AM, Ian Campbell wrote:
>>> On Sat, 2012-09-08 at 15:50 +0100, Maik Brauer wrote:
>>>> On Sep 4, 2012, at 10:11 AM, Ian Campbell wrote:
>>>>> Could you not top post please, it makes it rather hard to follow the
>>>>> flow of the conversation.
>>>>> On Mon, 2012-09-03 at 18:10 +0100, Casey DeLorme wrote:
>>>>>> As stated, you can alias shutdown to do exactly what you need, it can
>>>>>> be as simple as a series of hard-coded operations to a complex custom
>>>>>> shell script that parses your domains and closes each with feedback.
>>>>> Xen ships the "xendomains" initscript which can halt guest on shutdown
>>>>> as well as automatically start specific guests on boot. It can also be
>>>>> configured to suspend/resume them or (I think) migrate them away.
>>>>> For diagnosing the crash itself more details will be required than were
>>>>> provided in the original post. Please see
>>>>> http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for some guidance.
>>>>> At a minimum we would need a capture (serial console or photo) of the
>>>>> crash backtrace.
>>>> I found out that it hangs during re-boot of dom0 when having more
>>>> Network interfaces involved, like:
>>>> vif = [. 'mac=06:46:AB:CC:11:01, ip=<myIPadress>', '', '',
>>>> 'mac=06:04:AB:BB:11:03, bridge=VLAN20, script=vif-bridge', '',
>>>> 'mac=06:04:AB:BB:11:05, bridge=VLAN40, script=vif-bridge' ]
>>> 6 interfaces total, 3 of which have a random mac on each reboot and all
>>> get put on the default bridge?
>> No, not really. The bridge is different for each interface.
> You have three lots of '' which will all go onto the same bridge AFAICT
> (whichever one is determined to be the default)
That is right. As long as I put nothing inside that it should be a different script to execute, it will use default for ''
>>> If it is a hang then you might have some luck using hte magic sysrq keys
>>> to print lists of blocked tasks. I'm not sure in Squeeze but you might
>>> need to enable this as described in Documentation/sysrq.txt in the Linux
>>> Blocked tasks are listed with SysRQ-'w'. If you have serial console then
>>> 't' will list all task, but that list can be quite long so it is useless
>>> without a serial console.
>> List is empty. SysRQ -w and SysRQ-t shows nothing at all.
> You might need to increase the log verbosity with SysRQ-9 first?
I did and now I got more Information. But due to the amount of data which slips over the console screen I am not able
to record properly. Can you advice what to do here?
>> There is nothing running anymore.
>> It shows periodically: INFO: task xenwatch:12 blocked for more than 120 seconds
> What is the very last thing printed before this?
There is nothing before. Just that message pops up periodically.
>> Seems that the xenwatch is blocking the reboot here, is that assumption correct? But strange enough that I can't
>> see any process anymore with the SysRQ -t or SysRQ -w
> The xenwatch thread ought to count as a process for at least the
> purposes of SysRQ-t if not -w.
Could be, but due to the amount it slips over the screen, that I am not able to read it line by line.
Please advice a procedure to record.
>>>> In the Logfile of /var/log/message you can find this as the last line:
>>>> Sep 8 15:44:28 rootsrv01 shutdown: shutting down for system reboot
>>>> Sep 8 15:44:31 rootsrv01 kernel: [ 73.716246] VLAN20: port 1(vif2.3) entering forwarding state
>>>> Sep 8 15:44:31 rootsrv01 kernel: [ 74.500111] VLAN40: port 1(vif2.5) entering forwarding state
>>>> Sep 8 15:44:34 rootsrv01 kernel: [ 77.317431] VLAN20: port 1(vif2.3) entering disabled state
>>>> Sep 8 15:44:34 rootsrv01 kernel: [ 77.317490] VLAN20: port 1(vif2.3) entering disabled state
>>>> Sep 8 15:44:36 rootsrv01 kernel: [ 79.368685] VLAN40: port 1(vif2.5) entering disabled state
>>>> Sep 8 15:44:36 rootsrv01 kernel: [ 79.369156] VLAN40: port 1(vif2.5) entering disabled state
>>>> Sep 8 15:44:37 rootsrv01 kernel: Kernel logging (proc) stopped.
>>>> Sep 8 15:44:37 rootsrv01 rsyslogd: [origin software="rsyslogd" swVersion="4.6.4" x-pid="890" x-info="http://www.rsyslog.com"] exiting on signal 15.
>>>> In the /var/log/daemong.log you can find this message:
>>>> Sep 8 15:44:37 rootsrv01 acpid: exiting
>>>> Sep 8 15:44:37 rootsrv01 rpc.statd: Caught signal 15, un-registering and exiting
>>> All the above (both message and daemon.log) look like normal parts of
>>> shutting down to me.
>>>> Sep 8 15:44:37 rootsrv01 udevd-work: '/etc/xen/scripts/vif-setup offline type_if=vif' unexpected exit with status 0x000f
>>> This might be worth following up on.
>> When putting a "sleep 5" in stop section of the /etc/init.d/xendomains:
>> case "$1" in
>> if test -f $LOCKFILE; then rc_status -v; fi
>> rc_status -v
>> sleep 5
>> then the system shuts down as expected and is rebooting properly.
>> In the daemon.log file I couldn't find the error: Sep 8 15:44:37 rootsrv01 udevd-work: '/etc/xen/scripts/vif-setup offline type_if=vif' unexpected exit with status 0x000f
>> anymore. It seems that it disappeared after putting a delay inside. Could it be a race condition here during shutdown, with the udev-daemon??
> It could be a race with the guests actually shuting down vs the rest of
> the initscripts running.
> Really the initscript ought to wait, the default at least with the
> script shipped with xen is to do so, by using shutdown --wait. can you
> confirm whether or not this is happening for you?
At least I can see that the shutdown --wait is in the scripts. So it seems that the init script is waiting.
But independent from that, something must be still in use. Which block the reboot process.
> Possibly someone is trying to talk to xenstore after xenstored has
> exited -- I expect that would cause the sorts of blocked for 120
> messages you are seeing.
Could be, but we need to find out what is blocking the shutdown. I do not know what else I can do in order to measure and collect
data for investigation. Let me know what else I can do? You can easiliy reproduce this issue, when using more that 3 Network devices.
I installed that now on several machines at home and I have on all the same issue when using more than 2-3 network Interfaces.
> Xen-users mailing list
> Xen-users [at] lists
Xen-users mailing list
Xen-users [at] lists