Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

Tales of woe - possible Debian Squeeze kernel bugs, possible DRBD bugs, possible Xen bugs...

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


adam.wilbraham at technophobia

Jan 11, 2012, 8:56 AM

Post #1 of 6 (773 views)
Permalink
Tales of woe - possible Debian Squeeze kernel bugs, possible DRBD bugs, possible Xen bugs...

I've spent the past couple of days trying to get a pair of servers
into a state of stability and thought I would try and get down my
issues into a post & possible bug report whilst I have them in my
head. I'm probably going to miss some bits of information out, but
I'll try and get down what I remember. Its been a busy couple of days
& the combinations I've tried might mean that none of this is useful
for debugging because I haven't logged down as much as I should have
done as I was going through the troubleshooting process as I've been
blocking colleagues and therefore against the clock.

I started out with a pair of HP DL360 G6's which were built
approximately a year ago, running Debian Squeeze (before it became
stable) with Xen 4.0 on top all from Apt and with DRBD 8.3.10 built
from source (I believe this was stable at the time). The pair of
servers have only been used as internal development hosts and were
never patched up when Debian went stable, so the kernel version was a
little out of date
(xen-linux-system-2.6.32-5-xen-amd64_2.6.32-30_amd64.deb) as were
other packages. Over the last year the pair have been stable almost
all of the time, but we did have a couple of incidents where the pair
would reboot in tandem but because they weren't business critical the
resolution of this was never a priority.

Anyway, we moved the servers to a new location over the weekend and
almost from the point of power up this parallel reboot issue reared
its head. If the servers were sat there idling they would be fine, but
the minute I started to boot up domUs I began the risk of it
happening. Normally the more domUs running, the more likely it was to
kick a reboot. It seemed like it was most likely to happen when
starting another domU rather than just doing its running of online
VMs. Anyway, I eventually narrowed this down to a point where I
realised that if I unplugged the network cable that was being used for
DRBD replication then the server would spew output to the screen and
reboot instantly, with the other one in the pair going about a second
later.

First thing I thought here was that I'm massively out of the date on
patches, so lets apt-get update & apt-get dist-upgrade - this brought
a new kernel with it
(xen-linux-system-2.6.32-5-xen-amd64_2.6.32-38_amd64.deb) and at the
same time I think I decided it was probably wise to go to the latest
DRBD (8.4.1) so built the module and tools and off I went. This
brought an end to the random reboots, but it also brought new problems
which seemed to suggest that Xen could no longer properly access the
disk subsystem being exposed paravirtually into a domU if the domU was
using the Debian Squeeze kernel. I would get error messages from the
kernel within the domU saying that processes have been blocked for 120
seconds on startup and it never completed.At the same time, I was
seeing kernel oops messages on the console with a large hex string
being pasted out. There may be some stuff in my kern.log of relevance
here actually - I'll have to have a fish around. For reference, domUs
which I had running with Etch and Lenny kernels wouldn't exhibit this
problem and they booted fine.

I did some looking around and found various references to bugs in the
current Squeeze kernel, and suggestions to try the one from proposed
updates (2.6.32-40). Unfortunately, this didn't make any difference to
my problem.

As this was so far all looking kernel related I went looking for a
newer prebuilt kernel which I could try so first of all pulled
linux-image-2.6.39-bpo.2-amd64_2.6.39-3~bpo60+1_amd64.deb from Squeeze
backports, however this doesn't have blkback support so this meant
that its no good as a dom0 kernel. I then went and grabbed
linux-image-3.1.0-1-amd64_3.1.6-1_amd64.deb from Wheezy / Testing and
found that this works absolutely fine for me. Due to it being built
with gcc-4.6 I'm not in a position to build DRBD from source without
another chunk of work, so for the quickest reverted back to the in
kernel version (8.3.11) and grabbed a matching tools deb (albeit from
Ubuntu) and lo and behold I appear to have reached a point of
stability.

So to sum up - this pair is currently now running Squeeze (including
all proposed updates) + Wheezy Kernel +
drbd8-utils_2%3a8.3.7-2.1_amd64.deb from Ubuntu and I finally seem to
have reached a working state again.


--
Adam Wilbraham
Senior Systems Administrator

Technophobia Ltd, Velocity House, 3 Solly Street, Sheffield, S1 4DE

t: +44 (0)114 2212123
e: adam.wilbraham [at] technophobia
w: http://www.technophobia.com/
http://twitter.com/WeTechnophobia

Part of the Capita Group: www.capita.co.uk

Registered in England and Wales Company No. 3063669
VAT registration No. 618 1841 40
ISO 9001:2000 Accredited Company No. 21227
ISO 14001:2004 Accredited Company No. E997
ISO 27001:2005 (BS7799) Accredited Company No. IS 508906
Investor in People Certified No. 101507

The contents of this email are confidential to the addressee
and are intended solely for the recipients use. If you are not
the addressee, you have received this email in error.
Any disclosure, copying, distribution or action taken in
reliance on it is prohibited and may be unlawful.

Any opinions expressed in this email are those of the author
personally and not Technophobia Limited who do not accept
responsibility for the contents of the message.

All email communications, in and out of Technophobia,
are recorded for monitoring purposes.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


linux at alteeve

Jan 11, 2012, 9:15 AM

Post #2 of 6 (753 views)
Permalink
Re: Tales of woe - possible Debian Squeeze kernel bugs, possible DRBD bugs, possible Xen bugs... [In reply to]

On 01/11/2012 11:56 AM, Adam Wilbraham wrote:
> Anyway, we moved the servers to a new location over the weekend and
> almost from the point of power up this parallel reboot issue reared
> its head. If the servers were sat there idling they would be fine, but
> the minute I started to boot up domUs I began the risk of it
> happening. Normally the more domUs running, the more likely it was to
> kick a reboot. It seemed like it was most likely to happen when
> starting another domU rather than just doing its running of online
> VMs. Anyway, I eventually narrowed this down to a point where I
> realised that if I unplugged the network cable that was being used for
> DRBD replication then the server would spew output to the screen and
> reboot instantly, with the other one in the pair going about a second
> later.

If you have fencing configured, as you should, then you could have been
seeing a dual-fence problem. Basically, both nodes send off their kill
commands before one of them die. I believe this is a known issue with
some iLO based fencing, but I can't quote source. Generally, the test is
to put a 5sec delay into the fence script on one node. If it then
reliably dies first and the other node lives, you found the issue.

You *are* using fencing, right? ;)

--
Digimer
E-Mail: digimer [at] alteeve
Freenode handle: digimer
Papers and Projects: http://alteeve.com
Node Assassin: http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


florian at hastexo

Jan 11, 2012, 1:14 PM

Post #3 of 6 (745 views)
Permalink
Re: Tales of woe - possible Debian Squeeze kernel bugs, possible DRBD bugs, possible Xen bugs... [In reply to]

On Wed, Jan 11, 2012 at 5:56 PM, Adam Wilbraham
<adam.wilbraham [at] technophobia> wrote:
> I've spent the past couple of days trying to get a pair of servers
> into a state of stability and thought I would try and get down my
> issues into a post & possible bug report whilst I have them in my
> head. I'm probably going to miss some bits of information out, but
> I'll try and get down what I remember. Its been a busy couple of days
> & the combinations I've tried might mean that none of this is useful
> for debugging because I haven't logged down as much as I should have
> done as I was going through the troubleshooting process as I've been
> blocking colleagues and therefore against the clock.
>
> I started out with a pair of HP DL360 G6's which were built
> approximately a year ago, running Debian Squeeze (before it became
> stable) with Xen 4.0 on top all from Apt and with DRBD 8.3.10 built
> from source (I believe this was stable at the time). The pair of
> servers have only been used as internal development hosts and were
> never patched up when Debian went stable, so the kernel version was a
> little out of date
> (xen-linux-system-2.6.32-5-xen-amd64_2.6.32-30_amd64.deb) as were
> other packages. Over the last year the pair have been stable almost
> all of the time, but we did have a couple of incidents where the pair
> would reboot in tandem but because they weren't business critical the
> resolution of this was never a priority.
>
> Anyway, we moved the servers to a new location over the weekend and
> almost from the point of power up this parallel reboot issue reared
> its head. If the servers were sat there idling they would be fine, but
> the minute I started to boot up domUs I began the risk of it
> happening. Normally the more domUs running, the more likely it was to
> kick a reboot. It seemed like it was most likely to happen when
> starting another domU rather than just doing its running of online
> VMs. Anyway, I eventually narrowed this down to a point where I
> realised that if I unplugged the network cable that was being used for
> DRBD replication then the server would spew output to the screen and
> reboot instantly, with the other one in the pair going about a second
> later.

You didn't have the chance to hook up a serial terminal and capture
the log messages that way, I suppose?

Any idea whether you were getting a kernel panic, or an oops?

And, just checking, you did follow
http://www.drbd.org/users-guide-8.3/s-xen-drbd-mod-params.html?

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


adam.wilbraham at technophobia

Jan 12, 2012, 2:51 AM

Post #4 of 6 (743 views)
Permalink
Re: Tales of woe - possible Debian Squeeze kernel bugs, possible DRBD bugs, possible Xen bugs... [In reply to]

We don't have fencing configured, this pair has no Pacemaker or
anything like that - its purely manual failover. IIRC we suspected
that the fencing handlers may have been causing the very occasional
reboots we had seen so disabled the reboot calls in the fencing
config. The handlers currently looks like this:

pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh;";
pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh;";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh;";

... but we don't see any notifications or anything logged to suggest
fencing has been called.


On 11 January 2012 17:15, Digimer <linux [at] alteeve> wrote:
> On 01/11/2012 11:56 AM, Adam Wilbraham wrote:
>> Anyway, we moved the servers to a new location over the weekend and
>> almost from the point of power up this parallel reboot issue reared
>> its head. If the servers were sat there idling they would be fine, but
>> the minute I started to boot up domUs I began the risk of it
>> happening. Normally the more domUs running, the more likely it was to
>> kick a reboot. It seemed like it was most likely to happen when
>> starting another domU rather than just doing its running of online
>> VMs. Anyway, I eventually narrowed this down to a point where I
>> realised that if I unplugged the network cable that was being used for
>> DRBD replication then the server would spew output to the screen and
>> reboot instantly, with the other one in the pair going about a second
>> later.
>
> If you have fencing configured, as you should, then you could have been
> seeing a dual-fence problem. Basically, both nodes send off their kill
> commands before one of them die. I believe this is a known issue with
> some iLO based fencing, but I can't quote source. Generally, the test is
> to put a 5sec delay into the fence script on one node. If it then
> reliably dies first and the other node lives, you found the issue.
>
> You *are* using fencing, right? ;)
>
> --
> Digimer
> E-Mail: digimer [at] alteeve
> Freenode handle: digimer
> Papers and Projects: http://alteeve.com
> Node Assassin: http://nodeassassin.org
> "omg my singularity battery is dead again.
> stupid hawking radiation." - epitron



--
Adam Wilbraham
Senior Systems Administrator

Technophobia Ltd, Velocity House, 3 Solly Street, Sheffield, S1 4DE

t: +44 (0)114 2212123
e: adam.wilbraham [at] technophobia
w: http://www.technophobia.com/
http://twitter.com/WeTechnophobia

Part of the Capita Group: www.capita.co.uk

Registered in England and Wales Company No. 3063669
VAT registration No. 618 1841 40
ISO 9001:2000 Accredited Company No. 21227
ISO 14001:2004 Accredited Company No. E997
ISO 27001:2005 (BS7799) Accredited Company No. IS 508906
Investor in People Certified No. 101507

The contents of this email are confidential to the addressee
and are intended solely for the recipients use. If you are not
the addressee, you have received this email in error.
Any disclosure, copying, distribution or action taken in
reliance on it is prohibited and may be unlawful.

Any opinions expressed in this email are those of the author
personally and not Technophobia Limited who do not accept
responsibility for the contents of the message.

All email communications, in and out of Technophobia,
are recorded for monitoring purposes.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


adam.wilbraham at technophobia

Jan 12, 2012, 2:56 AM

Post #5 of 6 (745 views)
Permalink
Re: Tales of woe - possible Debian Squeeze kernel bugs, possible DRBD bugs, possible Xen bugs... [In reply to]

Unfortunately not. I have another pair of servers built in the same
way that are not currently used, so I'm hoping that when I get through
my backlog I can spend some further time debugging on those. Regarding
the disable_sendpage=1 module option, I don't believe that is enabled
actually, I'll try it on the other pair and see if it makes any
difference.

On 11 January 2012 21:14, Florian Haas <florian [at] hastexo> wrote:
> On Wed, Jan 11, 2012 at 5:56 PM, Adam Wilbraham
>
> You didn't have the chance to hook up a serial terminal and capture
> the log messages that way, I suppose?
>
> Any idea whether you were getting a kernel panic, or an oops?
>
> And, just checking, you did follow
> http://www.drbd.org/users-guide-8.3/s-xen-drbd-mod-params.html?
>
> Cheers,
> Florian
>
> --
> Need help with High Availability?
> http://www.hastexo.com/now
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user



--
Adam Wilbraham
Senior Systems Administrator

Technophobia Ltd, Velocity House, 3 Solly Street, Sheffield, S1 4DE

t: +44 (0)114 2212123
e: adam.wilbraham [at] technophobia
w: http://www.technophobia.com/
http://twitter.com/WeTechnophobia

Part of the Capita Group: www.capita.co.uk

Registered in England and Wales Company No. 3063669
VAT registration No. 618 1841 40
ISO 9001:2000 Accredited Company No. 21227
ISO 14001:2004 Accredited Company No. E997
ISO 27001:2005 (BS7799) Accredited Company No. IS 508906
Investor in People Certified No. 101507

The contents of this email are confidential to the addressee
and are intended solely for the recipients use. If you are not
the addressee, you have received this email in error.
Any disclosure, copying, distribution or action taken in
reliance on it is prohibited and may be unlawful.

Any opinions expressed in this email are those of the author
personally and not Technophobia Limited who do not accept
responsibility for the contents of the message.

All email communications, in and out of Technophobia,
are recorded for monitoring purposes.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


florian at hastexo

Jan 12, 2012, 3:28 AM

Post #6 of 6 (748 views)
Permalink
Re: Tales of woe - possible Debian Squeeze kernel bugs, possible DRBD bugs, possible Xen bugs... [In reply to]

On Thu, Jan 12, 2012 at 11:51 AM, Adam Wilbraham
<adam.wilbraham [at] technophobia> wrote:
> We don't have fencing configured, this pair has no Pacemaker or
> anything like that - its purely manual failover. IIRC we suspected
> that the fencing handlers may have been causing the very occasional
> reboots we had seen so disabled the reboot calls in the fencing
> config. The handlers currently looks like this:
>
>                  pri-on-incon-degr
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh;";
>                  pri-lost-after-sb
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh;";
>                  local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh;";

Sure this last bit is intentional? The local-io-error handler may not
even ever be invoked (you'd have to have disk { on-io-error
call-local-io-error; }" for it to ever fire), but most people prefer
to detach on I/O error.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.