Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

Testing local-io-error handler -- blkid hangs and ties up drbd device

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


chrisd1100 at gmail

Apr 12, 2012, 5:21 AM

Post #1 of 7 (459 views)
Permalink
Testing local-io-error handler -- blkid hangs and ties up drbd device

Hello,

Someone please chime in if my method of simulating io-errors is too
complicated and there is an easier way.

I've been trying to simulate IO errors with drbd 8.4.1 by creating a device
mapper with dmsetup. I create the device mapper from a 1GB LVM volume that
was initialized with internal meta data and synced:

dmsetup create bad_disk << EOF
0 1000 linear /dev/vg0/vol575 0
1000 1 error
1001 2096151 linear /dev/vg0/vol575 1001
EOF

I can now successfully start the drbd device backed by the bad_disk device
mapper and it shows Connected and UpToDate. When I change its role to
Primary, dmesg shows my IO error that I set at block 1000 and my custom
local-io-error script is called successfully. The drbd device is also set
to a disk state of Diskless.

It's at this moment that all other operations attempted on the device will
hang. Somewhere during or shortly after the io-error handler something ties
up the device and nothing I can do can free it... the first dmesg problem I
can see is this:

INFO: task blkid:1945 blocked for more than 120 seconds.

It might not be drbd, LVM is involved and also my manually created device
mapper on top of it. I wanted to throw this out there if anyone has tried
the same thing and encountered the error or if I'm doing something overtly
wrong.

Thanks,

Chris


lars.ellenberg at linbit

Apr 12, 2012, 5:36 AM

Post #2 of 7 (444 views)
Permalink
Re: Testing local-io-error handler -- blkid hangs and ties up drbd device [In reply to]

On Thu, Apr 12, 2012 at 08:21:38AM -0400, Chris Dickson wrote:
> Hello,
>
> Someone please chime in if my method of simulating io-errors is too
> complicated and there is an easier way.
>
> I've been trying to simulate IO errors with drbd 8.4.1 by creating a device
> mapper with dmsetup. I create the device mapper from a 1GB LVM volume that
> was initialized with internal meta data and synced:
>
> dmsetup create bad_disk << EOF
> 0 1000 linear /dev/vg0/vol575 0
> 1000 1 error
> 1001 2096151 linear /dev/vg0/vol575 1001
> EOF
>
> I can now successfully start the drbd device backed by the bad_disk device
> mapper and it shows Connected and UpToDate. When I change its role to
> Primary, dmesg shows my IO error that I set at block 1000 and my custom
> local-io-error script is called successfully. The drbd device is also set
> to a disk state of Diskless.
>
> It's at this moment that all other operations attempted on the device will
> hang. Somewhere during or shortly after the io-error handler something ties
> up the device and nothing I can do can free it... the first dmesg problem I
> can see is this:
>
> INFO: task blkid:1945 blocked for more than 120 seconds.
>
> It might not be drbd, LVM is involved and also my manually created device
> mapper on top of it. I wanted to throw this out there if anyone has tried
> the same thing and encountered the error or if I'm doing something overtly
> wrong.

What is your io error handler trying to do?

It is run synchronously from a drbd kernel thread which
also is (may be) necessary to process further IO requests on that drbd.

If you trigger synchronous IO on that drbd from the handler,
you deadlock on yourself.

You may not even be aware of it: if you do any lvm commands,
they will scan all devices (not filtered), and by doing so,
may trigger IO there.

If you do not expect to use DRBD as a PV,
please reject drbd from your filter in lvm.conf.

If that does not help already, and you try to do anything "interesting"
from that handler, consider backgrounding it.

Better yet, tell us what you actually want to achieve,
any we may be able to suggest a solution.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


chrisd1100 at gmail

Apr 12, 2012, 6:14 AM

Post #3 of 7 (452 views)
Permalink
Re: Testing local-io-error handler -- blkid hangs and ties up drbd device [In reply to]

Thanks for the quick reply,

My test handler currently isn't doing anything interesting, I just had it
echo 'hello world' to a file which is located on a different drive than the
LVM volume. The echo seems to have completed successfully as the file is
written.

The end goal for the handler is to simply insert a row into a remote DB,
other than that the default behavior on io-error of detaching is exactly
what I would like to have happen.

I just tried filtering out drbd in lvm.conf and that doesn't seem to be the
issue. After another try I did a quick ps auxf this showed up:

root 340 0.0 0.0 21392 1284 ? Ss 12:59 0:00 udevd
--daemon
root 415 0.0 0.0 21384 896 ? S 12:59 0:00 \_ udevd
--daemon
root 1775 0.0 0.0 8448 724 ? D 13:04 0:00 | \_
/sbin/blkid -o udev -p /dev/drbd575

So it seems like udev is initiating the blkid call, could it be doing this
before drbd has finished executing the handler?

Thanks,

Chris




On Thu, Apr 12, 2012 at 8:36 AM, Lars Ellenberg
<lars.ellenberg [at] linbit>wrote:

> On Thu, Apr 12, 2012 at 08:21:38AM -0400, Chris Dickson wrote:
> > Hello,
> >
> > Someone please chime in if my method of simulating io-errors is too
> > complicated and there is an easier way.
> >
> > I've been trying to simulate IO errors with drbd 8.4.1 by creating a
> device
> > mapper with dmsetup. I create the device mapper from a 1GB LVM volume
> that
> > was initialized with internal meta data and synced:
> >
> > dmsetup create bad_disk << EOF
> > 0 1000 linear /dev/vg0/vol575 0
> > 1000 1 error
> > 1001 2096151 linear /dev/vg0/vol575 1001
> > EOF
> >
> > I can now successfully start the drbd device backed by the bad_disk
> device
> > mapper and it shows Connected and UpToDate. When I change its role to
> > Primary, dmesg shows my IO error that I set at block 1000 and my custom
> > local-io-error script is called successfully. The drbd device is also set
> > to a disk state of Diskless.
> >
> > It's at this moment that all other operations attempted on the device
> will
> > hang. Somewhere during or shortly after the io-error handler something
> ties
> > up the device and nothing I can do can free it... the first dmesg
> problem I
> > can see is this:
> >
> > INFO: task blkid:1945 blocked for more than 120 seconds.
> >
> > It might not be drbd, LVM is involved and also my manually created device
> > mapper on top of it. I wanted to throw this out there if anyone has tried
> > the same thing and encountered the error or if I'm doing something
> overtly
> > wrong.
>
> What is your io error handler trying to do?
>
> It is run synchronously from a drbd kernel thread which
> also is (may be) necessary to process further IO requests on that drbd.
>
> If you trigger synchronous IO on that drbd from the handler,
> you deadlock on yourself.
>
> You may not even be aware of it: if you do any lvm commands,
> they will scan all devices (not filtered), and by doing so,
> may trigger IO there.
>
> If you do not expect to use DRBD as a PV,
> please reject drbd from your filter in lvm.conf.
>
> If that does not help already, and you try to do anything "interesting"
> from that handler, consider backgrounding it.
>
> Better yet, tell us what you actually want to achieve,
> any we may be able to suggest a solution.
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>


lars.ellenberg at linbit

Apr 12, 2012, 6:24 AM

Post #4 of 7 (430 views)
Permalink
Re: Testing local-io-error handler -- blkid hangs and ties up drbd device [In reply to]

On Thu, Apr 12, 2012 at 09:14:38AM -0400, Chris Dickson wrote:
> Thanks for the quick reply,
>
> My test handler currently isn't doing anything interesting, I just had it
> echo 'hello world' to a file which is located on a different drive than the
> LVM volume. The echo seems to have completed successfully as the file is
> written.
>
> The end goal for the handler is to simply insert a row into a remote DB,
> other than that the default behavior on io-error of detaching is exactly
> what I would like to have happen.
>
> I just tried filtering out drbd in lvm.conf and that doesn't seem to be the
> issue. After another try I did a quick ps auxf this showed up:
>
> root 340 0.0 0.0 21392 1284 ? Ss 12:59 0:00 udevd
> --daemon
> root 415 0.0 0.0 21384 896 ? S 12:59 0:00 \_ udevd
> --daemon
> root 1775 0.0 0.0 8448 724 ? D 13:04 0:00 | \_
> /sbin/blkid -o udev -p /dev/drbd575
>
> So it seems like udev is initiating the blkid call, could it be doing this
> before drbd has finished executing the handler?

If the handler finished,
(drbd prints "... helper command .... exit code ..." to the kernel log).
there is no reason for anything to hang.

DRBD is supposed to retry failed local requests on the peer, and if that
is not possible (no connection, or no good remote disk either), either
freeze IO (if so configured) or report IO errors back up the stack.

"Supposed to just work".

Maybe rather downgrade to 8.3.latest, I know we fixed some issues
in the retry logic on the way to 8.4.not-yet-but-"soon"-to-be-released.2

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


chrisd1100 at gmail

Apr 12, 2012, 6:40 AM

Post #5 of 7 (437 views)
Permalink
Re: Testing local-io-error handler -- blkid hangs and ties up drbd device [In reply to]

Thanks Lars, dmesg indeed reported the exit code of 0:

[ 332.733554] block drbd575: role( Secondary -> Primary )
[ 332.772827] block drbd575: disk( UpToDate -> Failed )
[ 332.772840] block drbd575: Local IO failed in __req_mod. Detaching...
[ 332.772925] block drbd575: helper command: /sbin/drbdadm local-io-error
minor-575
[ 332.790163] block drbd575: helper command: /sbin/drbdadm local-io-error
minor-575 exit code 0 (0x0)
[ 332.790189] block drbd575: disk( Failed -> Diskless )
[ 332.803862] block drbd575: receiver updated UUIDs to effective data
uuid: 2B81D15C3E0ADD80

The peer node is also locked up, all operations report:

r575: State change failed: (-10) State change was refused by peer node

One question on 8.3.latest, one of the reasons I wanted to use 8.4 was the
support for more minor numbers. It's not that I necessarily need more than
256 on one machine, but the way my numbering system works it makes it nice
to be able to assign minor numbers greater than 255. Is there a quick hack
somewhere in the source that I can increase this limit or is this a more
complex change made for 8.4?

Also the prefer-remote read balancing method is something that I was
interested in, but not super necessary.

Thanks,

Chris

On Thu, Apr 12, 2012 at 9:24 AM, Lars Ellenberg
<lars.ellenberg [at] linbit>wrote:

> On Thu, Apr 12, 2012 at 09:14:38AM -0400, Chris Dickson wrote:
> > Thanks for the quick reply,
> >
> > My test handler currently isn't doing anything interesting, I just had it
> > echo 'hello world' to a file which is located on a different drive than
> the
> > LVM volume. The echo seems to have completed successfully as the file is
> > written.
> >
> > The end goal for the handler is to simply insert a row into a remote DB,
> > other than that the default behavior on io-error of detaching is exactly
> > what I would like to have happen.
> >
> > I just tried filtering out drbd in lvm.conf and that doesn't seem to be
> the
> > issue. After another try I did a quick ps auxf this showed up:
> >
> > root 340 0.0 0.0 21392 1284 ? Ss 12:59 0:00 udevd
> > --daemon
> > root 415 0.0 0.0 21384 896 ? S 12:59 0:00 \_
> udevd
> > --daemon
> > root 1775 0.0 0.0 8448 724 ? D 13:04 0:00 | \_
> > /sbin/blkid -o udev -p /dev/drbd575
> >
> > So it seems like udev is initiating the blkid call, could it be doing
> this
> > before drbd has finished executing the handler?
>
> If the handler finished,
> (drbd prints "... helper command .... exit code ..." to the kernel log).
> there is no reason for anything to hang.
>
> DRBD is supposed to retry failed local requests on the peer, and if that
> is not possible (no connection, or no good remote disk either), either
> freeze IO (if so configured) or report IO errors back up the stack.
>
> "Supposed to just work".
>
> Maybe rather downgrade to 8.3.latest, I know we fixed some issues
> in the retry logic on the way to 8.4.not-yet-but-"soon"-to-be-released.2
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>


chrisd1100 at gmail

Apr 12, 2012, 8:18 AM

Post #6 of 7 (432 views)
Permalink
Re: Testing local-io-error handler -- blkid hangs and ties up drbd device [In reply to]

A little more info:

If I set the the node with the good disk to primary, then write 100MB to
the drbd volume, the drbd node with the bad disk calls my handler
successfully, detaches and does not hang. It seems to only hang when I
change the node with the bad disk's role to Primary.


On Thu, Apr 12, 2012 at 9:40 AM, Chris Dickson <chrisd1100 [at] gmail> wrote:

> Thanks Lars, dmesg indeed reported the exit code of 0:
>
> [ 332.733554] block drbd575: role( Secondary -> Primary )
> [ 332.772827] block drbd575: disk( UpToDate -> Failed )
> [ 332.772840] block drbd575: Local IO failed in __req_mod. Detaching...
> [ 332.772925] block drbd575: helper command: /sbin/drbdadm local-io-error
> minor-575
> [ 332.790163] block drbd575: helper command: /sbin/drbdadm local-io-error
> minor-575 exit code 0 (0x0)
> [ 332.790189] block drbd575: disk( Failed -> Diskless )
> [ 332.803862] block drbd575: receiver updated UUIDs to effective data
> uuid: 2B81D15C3E0ADD80
>
> The peer node is also locked up, all operations report:
>
> r575: State change failed: (-10) State change was refused by peer node
>
> One question on 8.3.latest, one of the reasons I wanted to use 8.4 was the
> support for more minor numbers. It's not that I necessarily need more than
> 256 on one machine, but the way my numbering system works it makes it nice
> to be able to assign minor numbers greater than 255. Is there a quick hack
> somewhere in the source that I can increase this limit or is this a more
> complex change made for 8.4?
>
> Also the prefer-remote read balancing method is something that I was
> interested in, but not super necessary.
>
> Thanks,
>
> Chris
>
> On Thu, Apr 12, 2012 at 9:24 AM, Lars Ellenberg <lars.ellenberg [at] linbit
> > wrote:
>
>> On Thu, Apr 12, 2012 at 09:14:38AM -0400, Chris Dickson wrote:
>> > Thanks for the quick reply,
>> >
>> > My test handler currently isn't doing anything interesting, I just had
>> it
>> > echo 'hello world' to a file which is located on a different drive than
>> the
>> > LVM volume. The echo seems to have completed successfully as the file is
>> > written.
>> >
>> > The end goal for the handler is to simply insert a row into a remote DB,
>> > other than that the default behavior on io-error of detaching is exactly
>> > what I would like to have happen.
>> >
>> > I just tried filtering out drbd in lvm.conf and that doesn't seem to be
>> the
>> > issue. After another try I did a quick ps auxf this showed up:
>> >
>> > root 340 0.0 0.0 21392 1284 ? Ss 12:59 0:00 udevd
>> > --daemon
>> > root 415 0.0 0.0 21384 896 ? S 12:59 0:00 \_
>> udevd
>> > --daemon
>> > root 1775 0.0 0.0 8448 724 ? D 13:04 0:00 | \_
>> > /sbin/blkid -o udev -p /dev/drbd575
>> >
>> > So it seems like udev is initiating the blkid call, could it be doing
>> this
>> > before drbd has finished executing the handler?
>>
>> If the handler finished,
>> (drbd prints "... helper command .... exit code ..." to the kernel log).
>> there is no reason for anything to hang.
>>
>> DRBD is supposed to retry failed local requests on the peer, and if that
>> is not possible (no connection, or no good remote disk either), either
>> freeze IO (if so configured) or report IO errors back up the stack.
>>
>> "Supposed to just work".
>>
>> Maybe rather downgrade to 8.3.latest, I know we fixed some issues
>> in the retry logic on the way to 8.4.not-yet-but-"soon"-to-be-released.2
>>
>> --
>> : Lars Ellenberg
>> : LINBIT | Your Way to High Availability
>> : DRBD/HA support and consulting http://www.linbit.com
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user [at] lists
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>
>


chrisd1100 at gmail

Apr 12, 2012, 12:51 PM

Post #7 of 7 (447 views)
Permalink
Re: Testing local-io-error handler -- blkid hangs and ties up drbd device [In reply to]

I can confirm that the issue is neither present in drbd 8.3.13rc1 or 8.4.1
stable. The issue must have been a result of some of the code introduced
between the 8.4.1 release and the current master.

Chris

On Thu, Apr 12, 2012 at 11:18 AM, Chris Dickson <chrisd1100 [at] gmail>wrote:

> A little more info:
>
> If I set the the node with the good disk to primary, then write 100MB to
> the drbd volume, the drbd node with the bad disk calls my handler
> successfully, detaches and does not hang. It seems to only hang when I
> change the node with the bad disk's role to Primary.
>
>
> On Thu, Apr 12, 2012 at 9:40 AM, Chris Dickson <chrisd1100 [at] gmail>wrote:
>
>> Thanks Lars, dmesg indeed reported the exit code of 0:
>>
>> [ 332.733554] block drbd575: role( Secondary -> Primary )
>> [ 332.772827] block drbd575: disk( UpToDate -> Failed )
>> [ 332.772840] block drbd575: Local IO failed in __req_mod. Detaching...
>> [ 332.772925] block drbd575: helper command: /sbin/drbdadm
>> local-io-error minor-575
>> [ 332.790163] block drbd575: helper command: /sbin/drbdadm
>> local-io-error minor-575 exit code 0 (0x0)
>> [ 332.790189] block drbd575: disk( Failed -> Diskless )
>> [ 332.803862] block drbd575: receiver updated UUIDs to effective data
>> uuid: 2B81D15C3E0ADD80
>>
>> The peer node is also locked up, all operations report:
>>
>> r575: State change failed: (-10) State change was refused by peer node
>>
>> One question on 8.3.latest, one of the reasons I wanted to use 8.4 was
>> the support for more minor numbers. It's not that I necessarily need more
>> than 256 on one machine, but the way my numbering system works it makes it
>> nice to be able to assign minor numbers greater than 255. Is there a quick
>> hack somewhere in the source that I can increase this limit or is this a
>> more complex change made for 8.4?
>>
>> Also the prefer-remote read balancing method is something that I was
>> interested in, but not super necessary.
>>
>> Thanks,
>>
>> Chris
>>
>> On Thu, Apr 12, 2012 at 9:24 AM, Lars Ellenberg <
>> lars.ellenberg [at] linbit> wrote:
>>
>>> On Thu, Apr 12, 2012 at 09:14:38AM -0400, Chris Dickson wrote:
>>> > Thanks for the quick reply,
>>> >
>>> > My test handler currently isn't doing anything interesting, I just had
>>> it
>>> > echo 'hello world' to a file which is located on a different drive
>>> than the
>>> > LVM volume. The echo seems to have completed successfully as the file
>>> is
>>> > written.
>>> >
>>> > The end goal for the handler is to simply insert a row into a remote
>>> DB,
>>> > other than that the default behavior on io-error of detaching is
>>> exactly
>>> > what I would like to have happen.
>>> >
>>> > I just tried filtering out drbd in lvm.conf and that doesn't seem to
>>> be the
>>> > issue. After another try I did a quick ps auxf this showed up:
>>> >
>>> > root 340 0.0 0.0 21392 1284 ? Ss 12:59 0:00 udevd
>>> > --daemon
>>> > root 415 0.0 0.0 21384 896 ? S 12:59 0:00 \_
>>> udevd
>>> > --daemon
>>> > root 1775 0.0 0.0 8448 724 ? D 13:04 0:00 |
>>> \_
>>> > /sbin/blkid -o udev -p /dev/drbd575
>>> >
>>> > So it seems like udev is initiating the blkid call, could it be doing
>>> this
>>> > before drbd has finished executing the handler?
>>>
>>> If the handler finished,
>>> (drbd prints "... helper command .... exit code ..." to the kernel log).
>>> there is no reason for anything to hang.
>>>
>>> DRBD is supposed to retry failed local requests on the peer, and if that
>>> is not possible (no connection, or no good remote disk either), either
>>> freeze IO (if so configured) or report IO errors back up the stack.
>>>
>>> "Supposed to just work".
>>>
>>> Maybe rather downgrade to 8.3.latest, I know we fixed some issues
>>> in the retry logic on the way to 8.4.not-yet-but-"soon"-to-be-released.2
>>>
>>> --
>>> : Lars Ellenberg
>>> : LINBIT | Your Way to High Availability
>>> : DRBD/HA support and consulting http://www.linbit.com
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user [at] lists
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>
>>
>>
>

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.