Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

recovering from "Local IO failed. Detaching..."

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


gianluca.cecchi at gmail

Sep 10, 2009, 6:58 AM

Post #1 of 16 (2276 views)
Permalink
recovering from "Local IO failed. Detaching..."

Fedora 11 x86_64 with kernel 2.6.30.5-43.fc11.x86_64 and drbd-8.3.3rc1
compiled from source with make rpm
so that I have now
[root [at] virtfedbi ]# rpm -qa drbd*
drbd-8.3.3rc1-3.x86_64
drbd-km-2.6.30.5_43.fc11.x86_64-8.3.3rc1-3.x86_64

The configuration is Primary/Primary

I get this message on one node
Sep 8 17:32:34 virtfedbis kernel: block drbd0: disk( UpToDate -> Failed )
Sep 8 17:32:34 virtfedbis kernel: block drbd0: Local IO failed.
Detaching...
Sep 8 17:32:34 virtfedbis kernel: block drbd0: disk( Failed -> Diskless )
Sep 8 17:32:34 virtfedbis kernel: block drbd0: Notified peer that my disk
is broken.

Now "service drdbd status" command on this node gives:
drbd driver loaded OK; device status:
version: 8.3.3rc1 (api:88/proto:86-91)
GIT-hash: 026d60bb0e6a7d5758c6c3e6245f38f6d8b921aa build by
root [at] virtfedbis, 2009-09-08 16:21:30
m:res cs ro ds p mounted fstype
0:r0 Connected Primary/Primary Diskless/UpToDate C

2 problems:

a) it seems I actually didn't get any I/O error in messages, apart from this
drbd one.....
how to check if actually I had an I/O error?

b) how are the proper commands to recover or at least try to recover,
supposing the disk is ok?

The disk is an hw raid on a Hp blade and I don't get any hw error indeed
also from information provided by iLO....
Does drdbd support some kind of queuing via drbd.conf, or does it inherit
queuing from scsi layer or what else?

Only messages I get before this event are some minutes before when peer drbd
daemon started and so sync happened:

Sep 8 17:29:35 virtfedbis kernel: block drbd0: Handshake successful: Agreed
network protocol version 91
Sep 8 17:29:35 virtfedbis kernel: block drbd0: Peer authenticated using 20
bytes of 'sha1' HMAC
Sep 8 17:29:35 virtfedbis kernel: block drbd0: conn( WFConnection ->
WFReportParams )
Sep 8 17:29:35 virtfedbis kernel: block drbd0: Starting asender thread
(from drbd0_receiver [9977])
Sep 8 17:29:35 virtfedbis kernel: block drbd0: data-integrity-alg:
<not-used>
Sep 8 17:29:35 virtfedbis kernel: block drbd0: drbd_sync_handshake:
Sep 8 17:29:35 virtfedbis kernel: block drbd0: self
FFEDAA5E725D8157:0DB564243F5AA9A3:377245292BBD1112:F6DD5DF112448173 bits:0
flags:0
Sep 8 17:29:35 virtfedbis kernel: block drbd0: peer
0DB564243F5AA9A2:0000000000000000:377245292BBD1113:F6DD5DF112448173 bits:0
flags:0
Sep 8 17:29:35 virtfedbis kernel: block drbd0: uuid_compare()=1 by rule 70
Sep 8 17:29:35 virtfedbis kernel: block drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS )
Sep 8 17:29:35 virtfedbis kernel: block drbd0: peer( Secondary -> Primary )

Sep 8 17:29:35 virtfedbis kernel: block drbd0: conn( WFBitMapS ->
SyncSource ) pdsk( Outdated -> Inconsistent )
Sep 8 17:29:35 virtfedbis kernel: block drbd0: Began resync as SyncSource
(will sync 0 KB [0 bits set]).
Sep 8 17:29:35 virtfedbis kernel: block drbd0: Resync done (total 1 sec;
paused 0 sec; 0 K/sec)
Sep 8 17:29:35 virtfedbis kernel: block drbd0: conn( SyncSource ->
Connected ) pdsk( Inconsistent -> UpToDate )
Sep 8 17:29:40 virtfedbis kernel: block drbd0: md_sync_timer expired!
Worker calls drbd_md_sync().

similar output from dmesg command gives as latest rows:

block drbd0: drbd_sync_handshake:
block drbd0: self
FFEDAA5E725D8157:0DB564243F5AA9A3:377245292BBD1112:F6DD5DF112448173 bits:0
flags:0
block drbd0: peer
0DB564243F5AA9A2:0000000000000000:377245292BBD1113:F6DD5DF112448173 bits:0
flags:0
block drbd0: uuid_compare()=1 by rule 70
block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS
)
block drbd0: peer( Secondary -> Primary )
block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent
)
block drbd0: Began resync as SyncSource (will sync 0 KB [0 bits set]).
block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate
)
dlm: connecting to 1
block drbd0: md_sync_timer expired! Worker calls drbd_md_sync().
block drbd0: disk( UpToDate -> Failed )
block drbd0: Local IO failed. Detaching...
block drbd0: disk( Failed -> Diskless )
block drbd0: Notified peer that my disk is broken.

Thanks,
Gianluca


gianluca.cecchi at gmail

Sep 10, 2009, 8:44 AM

Post #2 of 16 (2182 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

tried the "drbdadm attach r0" command but I get:

Sep 10 17:38:45 virtfedbis kernel: block drbd0: disk( Diskless -> Attaching
)
Sep 10 17:38:45 virtfedbis kernel: block drbd0: Found 6 transactions (244
active extents) in activity log.
Sep 10 17:38:45 virtfedbis kernel: block drbd0: Method to ensure write
ordering: barrier
Sep 10 17:38:45 virtfedbis kernel: block drbd0: max_segment_size ( = BIO
size ) = 32768
Sep 10 17:38:45 virtfedbis kernel: block drbd0: recounting of set bits took
additional 2 jiffies
Sep 10 17:38:45 virtfedbis kernel: block drbd0: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Sep 10 17:38:45 virtfedbis kernel: block drbd0: Marked additional 920 MB as
out-of-sync based on AL.
Sep 10 17:38:45 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
sector 0
Sep 10 17:38:45 virtfedbis kernel: block drbd0: meta data flush failed with
status -95, disabling md-flushes
Sep 10 17:38:45 virtfedbis kernel: block drbd0: disk( Attaching ->
Negotiating )
Sep 10 17:38:45 virtfedbis kernel: block drbd0: drbd_sync_handshake:
Sep 10 17:38:45 virtfedbis kernel: block drbd0: self
FFEDAA5E725D8157:13925DF660B57F5D:0DB564243F5AA9A3:377245292BBD1112
bits:235520 flags:0
Sep 10 17:38:45 virtfedbis kernel: block drbd0: peer
A0332E51B243BEE1:FFEDAA5E725D8157:13925DF660B57F5D:0DB564243F5AA9A3
bits:105320 flags:0
Sep 10 17:38:45 virtfedbis kernel: block drbd0: uuid_compare()=-1 by rule 50
Sep 10 17:38:45 virtfedbis kernel: block drbd0: conn( Connected -> WFBitMapT
) disk( Negotiating -> Outdated )
Sep 10 17:38:45 virtfedbis kernel: block drbd0: conn( WFBitMapT ->
WFSyncUUID )
Sep 10 17:38:45 virtfedbis kernel: block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0
Sep 10 17:38:45 virtfedbis kernel: block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Sep 10 17:38:45 virtfedbis kernel: block drbd0: conn( WFSyncUUID ->
SyncTarget ) disk( Outdated -> Inconsistent )
Sep 10 17:38:45 virtfedbis kernel: block drbd0: Began resync as SyncTarget
(will sync 1363360 KB [340840 bits set]).
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Resync aborted.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: conn( SyncTarget ->
Connected ) disk( Inconsistent -> Failed )
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Local IO failed.
Detaching...
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Can not write mirrored data
block to local disk.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 17:38:48 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 17:38:48 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Can not write mirrored data
block to local disk.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Can not write mirrored data
block to local disk.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Can not write mirrored data
block to local disk.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Can not write mirrored data
block to local disk.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 17:38:48 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 17:38:48 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 17:38:48 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 17:38:48 virtfedbis kernel: block drbd0: disk( Failed -> Diskless )
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Notified peer that my disk
is broken.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Can not write resync data to
local disk.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Can not write resync data to
local disk.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Can not write resync data to
local disk.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Can not write resync data to
local disk.
Sep 10 17:38:48 virtfedbis kernel: block drbd0: Can not write resync data to
local disk.

my r0 resource is /dev/cciss/c0d0p3
A dd seems to get no problems (actually I interrupted after a while)
[root [at] virtfedbi ~]# time dd if=/dev/cciss/c0d0p3 of=/dev/null bs=1024k
^C4230+0 records in
4229+0 records out
4434427904 bytes (4.4 GB) copied, 44.5441 s, 99.6 MB/s


real 0m44.546s
user 0m0.003s
sys 0m4.575s

what exactly is the meaning of sector 0 in the line:
Sep 10 17:38:45 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
sector 0

Thanks in advance,
Gianluca


lars.ellenberg at linbit

Sep 10, 2009, 9:20 AM

Post #3 of 16 (2194 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Thu, Sep 10, 2009 at 05:44:42PM +0200, Gianluca Cecchi wrote:
> tried the "drbdadm attach r0" command but I get:
>
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: disk( Diskless -> Attaching
> )
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: Found 6 transactions (244
> active extents) in activity log.
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: Method to ensure write
> ordering: barrier
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: max_segment_size ( = BIO
> size ) = 32768
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: recounting of set bits took
> additional 2 jiffies
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: 0 KB (0 bits) marked
> out-of-sync by on disk bit-map.
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: Marked additional 920 MB as
> out-of-sync based on AL.
> Sep 10 17:38:45 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
> sector 0

this is interessting.
"should not happen",
and happens below drbd.
so something actually is not as it should be at the cciss level,
or at some other involved hardware level.

> Sep 10 17:38:45 virtfedbis kernel: block drbd0: meta data flush failed with
> status -95, disabling md-flushes
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: disk( Attaching ->
> Negotiating )
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: drbd_sync_handshake:
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: self
> FFEDAA5E725D8157:13925DF660B57F5D:0DB564243F5AA9A3:377245292BBD1112
> bits:235520 flags:0
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: peer
> A0332E51B243BEE1:FFEDAA5E725D8157:13925DF660B57F5D:0DB564243F5AA9A3
> bits:105320 flags:0
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: uuid_compare()=-1 by rule 50
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: conn( Connected -> WFBitMapT
> ) disk( Negotiating -> Outdated )
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: conn( WFBitMapT ->
> WFSyncUUID )
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: conn( WFSyncUUID ->
> SyncTarget ) disk( Outdated -> Inconsistent )
> Sep 10 17:38:45 virtfedbis kernel: block drbd0: Began resync as SyncTarget
> (will sync 1363360 KB [340840 bits set]).
> Sep 10 17:38:48 virtfedbis kernel: block drbd0: Resync aborted.
> Sep 10 17:38:48 virtfedbis kernel: block drbd0: conn( SyncTarget ->
> Connected ) disk( Inconsistent -> Failed )
> Sep 10 17:38:48 virtfedbis kernel: block drbd0: Local IO failed.
> Detaching...

detaches "without visible reason"...

I've seen similar symptoms before, and it could be worked around by
disabling offloading settings on the NICs used for the replication ;)
I know, that interaction sounds a bit far-fetched, but those are the
facts.

# to view offload settings
ethtool -k eth7
# to switch them all off:
ethtool -K eth7 rx off tx off sg off tso off


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


gianluca.cecchi at gmail

Sep 10, 2009, 9:28 AM

Post #4 of 16 (2189 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Thu, Sep 10, 2009 at 6:20 PM, Lars Ellenberg
<lars.ellenberg [at] linbit>wrote:

> [snip]
>
> I've seen similar symptoms before, and it could be worked around by
> disabling offloading settings on the NICs used for the replication ;)
> I know, that interaction sounds a bit far-fetched, but those are the
> facts.
>
> # to view offload settings
> ethtool -k eth7
> # to switch them all off:
> ethtool -K eth7 rx off tx off sg off tso off
>
>
>
[root [at] virtfedbi ~]# ethtool -k eth3
Offload parameters for eth3:
Cannot get device flags: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

[root [at] virtfedbi ~]# ethtool -K eth3 rx off tx off sg off tso off

[root [at] virtfedbi ~]# ethtool -k eth3
Offload parameters for eth3:
Cannot get device flags: Operation not supported
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

If I try the attach without doing the same settings on other peer eth3 I
get:

Sep 10 18:24:34 virtfedbis kernel: block drbd0: disk( Diskless -> Attaching
)
Sep 10 18:24:34 virtfedbis kernel: block drbd0: Found 6 transactions (244
active extents) in activity log.
Sep 10 18:24:34 virtfedbis kernel: block drbd0: Method to ensure write
ordering: barrier
Sep 10 18:24:34 virtfedbis kernel: block drbd0: max_segment_size ( = BIO
size ) = 32768
Sep 10 18:24:34 virtfedbis kernel: block drbd0: recounting of set bits took
additional 1 jiffies
Sep 10 18:24:34 virtfedbis kernel: block drbd0: 920 MB (235520 bits) marked
out-of-sync by on disk bit-map.
Sep 10 18:24:34 virtfedbis kernel: block drbd0: Marked additional 0 KB as
out-of-sync based on AL.
Sep 10 18:24:34 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
sector 0
Sep 10 18:24:34 virtfedbis kernel: block drbd0: meta data flush failed with
status -95, disabling md-flushes
Sep 10 18:24:34 virtfedbis kernel: block drbd0: disk( Attaching ->
Negotiating )
Sep 10 18:24:34 virtfedbis kernel: block drbd0: drbd_sync_handshake:
Sep 10 18:24:34 virtfedbis kernel: block drbd0: self
D5C42445B9F5C227:0000000000000000:0DB564243F5AA9A3:377245292BBD1112
bits:235520 flags:0
Sep 10 18:24:34 virtfedbis kernel: block drbd0: peer
A0332E51B243BEE1:D5C42445B9F5C227:FFEDAA5E725D8157:13925DF660B57F5D
bits:309189 flags:0
Sep 10 18:24:34 virtfedbis kernel: block drbd0: uuid_compare()=-1 by rule 50
Sep 10 18:24:34 virtfedbis kernel: block drbd0: Becoming sync target due to
disk states.
Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( Connected -> WFBitMapT
) disk( Negotiating -> Outdated )
Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( WFBitMapT ->
WFSyncUUID )
Sep 10 18:24:34 virtfedbis kernel: block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0
Sep 10 18:24:34 virtfedbis kernel: block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( WFSyncUUID ->
SyncTarget ) disk( Outdated -> Inconsistent )
Sep 10 18:24:34 virtfedbis kernel: block drbd0: Began resync as SyncTarget
(will sync 1236756 KB [309189 bits set]).
Sep 10 18:24:34 virtfedbis kernel: block drbd0: Resync aborted.
Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( SyncTarget ->
Connected ) disk( Inconsistent -> Failed )
Sep 10 18:24:34 virtfedbis kernel: block drbd0: Local IO failed.
Detaching...
Sep 10 18:24:34 virtfedbis kernel: block drbd0: disk( Failed -> Diskless )
Sep 10 18:24:34 virtfedbis kernel: block drbd0: Notified peer that my disk
is broken.

Even after setting same on other peer I get:

Sep 10 18:26:06 virtfedbis kernel: block drbd0: disk( Diskless -> Attaching
)
Sep 10 18:26:06 virtfedbis kernel: block drbd0: Found 6 transactions (244
active extents) in activity log.
Sep 10 18:26:06 virtfedbis kernel: block drbd0: Method to ensure write
ordering: barrier
Sep 10 18:26:06 virtfedbis kernel: block drbd0: max_segment_size ( = BIO
size ) = 32768
Sep 10 18:26:06 virtfedbis kernel: block drbd0: recounting of set bits took
additional 1 jiffies
Sep 10 18:26:06 virtfedbis kernel: block drbd0: 920 MB (235520 bits) marked
out-of-sync by on disk bit-map.
Sep 10 18:26:06 virtfedbis kernel: block drbd0: Marked additional 0 KB as
out-of-sync based on AL.
Sep 10 18:26:06 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
sector 0
Sep 10 18:26:06 virtfedbis kernel: block drbd0: meta data flush failed with
status -95, disabling md-flushes
Sep 10 18:26:06 virtfedbis kernel: block drbd0: disk( Attaching ->
Negotiating )
Sep 10 18:26:06 virtfedbis kernel: block drbd0: drbd_sync_handshake:
Sep 10 18:26:06 virtfedbis kernel: block drbd0: self
FAFACA8496A4ED9D:0000000000000000:0DB564243F5AA9A3:377245292BBD1112
bits:235520 flags:0
Sep 10 18:26:06 virtfedbis kernel: block drbd0: peer
A0332E51B243BEE1:FAFACA8496A4ED9D:D5C42445B9F5C227:FFEDAA5E725D8157
bits:310129 flags:0
Sep 10 18:26:06 virtfedbis kernel: block drbd0: uuid_compare()=-1 by rule 50
Sep 10 18:26:06 virtfedbis kernel: block drbd0: Becoming sync target due to
disk states.
Sep 10 18:26:06 virtfedbis kernel: block drbd0: conn( Connected -> WFBitMapT
) disk( Negotiating -> Outdated )
Sep 10 18:26:06 virtfedbis kernel: block drbd0: conn( WFBitMapT ->
WFSyncUUID )
Sep 10 18:26:06 virtfedbis kernel: block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0
Sep 10 18:26:06 virtfedbis kernel: block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Sep 10 18:26:06 virtfedbis kernel: block drbd0: conn( WFSyncUUID ->
SyncTarget ) disk( Outdated -> Inconsistent )
Sep 10 18:26:06 virtfedbis kernel: block drbd0: Began resync as SyncTarget
(will sync 1240516 KB [310129 bits set]).
Sep 10 18:26:07 virtfedbis kernel: block drbd0: Resync aborted.
Sep 10 18:26:07 virtfedbis kernel: block drbd0: conn( SyncTarget ->
Connected ) disk( Inconsistent -> Failed )
Sep 10 18:26:07 virtfedbis kernel: block drbd0: Local IO failed.
Detaching...
Sep 10 18:26:07 virtfedbis kernel: block drbd0: 1121 messages suppressed in
/root/drbd-8.3.3rc1/dist/BUILD/drbd-8.3.3rc1/drbd/drbd_receiver.c:1573.
Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
local disk.
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
local disk.
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
local disk.
Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
local disk.
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
called, but extent not found
Sep 10 18:26:07 virtfedbis kernel: block drbd0: disk( Failed -> Diskless )
Sep 10 18:26:07 virtfedbis kernel: block drbd0: Notified peer that my disk
is broken.
Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
local disk.


lars.ellenberg at linbit

Sep 10, 2009, 9:47 AM

Post #5 of 16 (2180 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Thu, Sep 10, 2009 at 06:28:23PM +0200, Gianluca Cecchi wrote:
> On Thu, Sep 10, 2009 at 6:20 PM, Lars Ellenberg
> <lars.ellenberg [at] linbit>wrote:
>
> > [snip]
> >
> > I've seen similar symptoms before, and it could be worked around by
> > disabling offloading settings on the NICs used for the replication ;)
> > I know, that interaction sounds a bit far-fetched, but those are the
> > facts.
> >
> > # to view offload settings
> > ethtool -k eth7
> > # to switch them all off:
> > ethtool -K eth7 rx off tx off sg off tso off
> >
> >
> >
> [root [at] virtfedbi ~]# ethtool -k eth3
> Offload parameters for eth3:
> Cannot get device flags: Operation not supported
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp-segmentation-offload: on
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: off
> large-receive-offload: off
>
> [root [at] virtfedbi ~]# ethtool -K eth3 rx off tx off sg off tso off
>
> [root [at] virtfedbi ~]# ethtool -k eth3
> Offload parameters for eth3:
> Cannot get device flags: Operation not supported
> rx-checksumming: off
> tx-checksumming: off
> scatter-gather: off
> tcp-segmentation-offload: off
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: off
> large-receive-offload: off
>
> If I try the attach without doing the same settings on other peer eth3 I
> get:
>
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: disk( Diskless -> Attaching
> )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Found 6 transactions (244
> active extents) in activity log.
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Method to ensure write
> ordering: barrier
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: max_segment_size ( = BIO
> size ) = 32768
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: recounting of set bits took
> additional 1 jiffies
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: 920 MB (235520 bits) marked
> out-of-sync by on disk bit-map.
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Marked additional 0 KB as
> out-of-sync based on AL.
> Sep 10 18:24:34 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
> sector 0
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: meta data flush failed with
> status -95, disabling md-flushes
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: disk( Attaching ->
> Negotiating )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: drbd_sync_handshake:
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: self
> D5C42445B9F5C227:0000000000000000:0DB564243F5AA9A3:377245292BBD1112
> bits:235520 flags:0
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: peer
> A0332E51B243BEE1:D5C42445B9F5C227:FFEDAA5E725D8157:13925DF660B57F5D
> bits:309189 flags:0
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: uuid_compare()=-1 by rule 50
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Becoming sync target due to
> disk states.
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( Connected -> WFBitMapT
> ) disk( Negotiating -> Outdated )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( WFBitMapT ->
> WFSyncUUID )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( WFSyncUUID ->
> SyncTarget ) disk( Outdated -> Inconsistent )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Began resync as SyncTarget
> (will sync 1236756 KB [309189 bits set]).
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Resync aborted.
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( SyncTarget ->
> Connected ) disk( Inconsistent -> Failed )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Local IO failed.
> Detaching...
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: disk( Failed -> Diskless )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Notified peer that my disk
> is broken.
>
> Even after setting same on other peer I get:
>
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: disk( Diskless -> Attaching
> )
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: Found 6 transactions (244
> active extents) in activity log.
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: Method to ensure write
> ordering: barrier
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: max_segment_size ( = BIO
> size ) = 32768
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: recounting of set bits took
> additional 1 jiffies
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: 920 MB (235520 bits) marked
> out-of-sync by on disk bit-map.
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: Marked additional 0 KB as
> out-of-sync based on AL.
> Sep 10 18:26:06 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
> sector 0
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: meta data flush failed with
> status -95, disabling md-flushes
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: disk( Attaching ->
> Negotiating )
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: drbd_sync_handshake:
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: self
> FAFACA8496A4ED9D:0000000000000000:0DB564243F5AA9A3:377245292BBD1112
> bits:235520 flags:0
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: peer
> A0332E51B243BEE1:FAFACA8496A4ED9D:D5C42445B9F5C227:FFEDAA5E725D8157
> bits:310129 flags:0
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: uuid_compare()=-1 by rule 50
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: Becoming sync target due to
> disk states.
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: conn( Connected -> WFBitMapT
> ) disk( Negotiating -> Outdated )
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: conn( WFBitMapT ->
> WFSyncUUID )
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: conn( WFSyncUUID ->
> SyncTarget ) disk( Outdated -> Inconsistent )
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: Began resync as SyncTarget
> (will sync 1240516 KB [310129 bits set]).
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Resync aborted.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: conn( SyncTarget ->
> Connected ) disk( Inconsistent -> Failed )
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Local IO failed.
> Detaching...
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: 1121 messages suppressed in
> /root/drbd-8.3.3rc1/dist/BUILD/drbd-8.3.3rc1/drbd/drbd_receiver.c:1573.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
> local disk.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
> local disk.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
> local disk.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
> local disk.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: disk( Failed -> Diskless )
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Notified peer that my disk
> is broken.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
> local disk.


too bad.
then something is wrong with your hardware, or your setup.
or your kernel.
or, of course, maybe only something is wrong with drbd (in your setup on
your hardware ;-])

care to try
no-disk-flushes;
no-md-flushes;
no-disk-barrier;
?
if that does not help:
8.3.2?
8.3.3rc2?
various other drbd versions? kernels?
different lower level device? (not cciss? other cciss drive/partition?)
etc.

if all else fails: contact linbit, we do sell support.

we even sell "drbd health checks", which somewhat boils down to a
one-time engagement - though for those you may need to wait for a
suitable (for linbit) time-slot.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Sep 10, 2009, 10:11 AM

Post #6 of 16 (2178 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Thu, Sep 10, 2009 at 06:47:24PM +0200, Lars Ellenberg wrote:
> then something is wrong with your hardware, or your setup.
> or your kernel.
> or, of course, maybe only something is wrong with drbd (in your setup on
> your hardware ;-])
>
> care to try
> no-disk-flushes;
> no-md-flushes;
> no-disk-barrier;
> ?

hmmm.
"interessting"

I think only adding "no-md-flushes" should help.
if that does,
please use drbd-8.3.3rc2,
then add below patch,
and __leave off__ the no-md-flushes option again.
so we can confirm that the fallback and retry without barriers
does finally work as expected.

thanks.

usually, non-working barriers are detected early by some other means,
but if the timing on your box is unlucky, drbd may end up in this
function before the other code path has determined that barriers don't work.
and the fallback error path in there apparently has been broken for a
long time :(

diff --git a/drbd/drbd_actlog.c b/drbd/drbd_actlog.c
index 708b689..cb2aa43 100644
--- a/drbd/drbd_actlog.c
+++ b/drbd/drbd_actlog.c
@@ -80,8 +80,6 @@ STATIC int _drbd_md_sync_page_io(struct drbd_conf *mdev,
int ok;

md_io.mdev = mdev;
- init_completion(&md_io.event);
- md_io.error = 0;

if (rw == WRITE && !test_bit(MD_NO_BARRIER, &mdev->flags))
rw |= (1<<BIO_RW_BARRIER);
@@ -107,6 +105,10 @@ STATIC int _drbd_md_sync_page_io(struct drbd_conf *mdev,

trace_drbd_bio(mdev, "Md", bio, 0, NULL);

+ /* on retry, this is re-init */
+ init_completion(&md_io.event);
+ md_io.error = 0;
+
if (FAULT_ACTIVE(mdev, (rw & WRITE) ? DRBD_FAULT_MD_WR : DRBD_FAULT_MD_RD))
bio_endio(bio, -EIO);
else


> if that does not help:
> 8.3.2?
> 8.3.3rc2?
> various other drbd versions? kernels?
> different lower level device? (not cciss? other cciss drive/partition?)
> etc.
>
> if all else fails: contact linbit, we do sell support.
>
> we even sell "drbd health checks", which somewhat boils down to a
> one-time engagement - though for those you may need to wait for a
> suitable (for linbit) time-slot.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Sep 10, 2009, 2:27 PM

Post #7 of 16 (2176 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Thu, Sep 10, 2009 at 07:11:02PM +0200, Lars Ellenberg wrote:
> On Thu, Sep 10, 2009 at 06:47:24PM +0200, Lars Ellenberg wrote:
> > then something is wrong with your hardware, or your setup.
> > or your kernel.
> > or, of course, maybe only something is wrong with drbd (in your setup on
> > your hardware ;-])
> >
> > care to try
> > no-disk-flushes;
> > no-md-flushes;
> > no-disk-barrier;
> > ?


> then add below patch,

...

> and the fallback error path in there apparently has been broken for a
> long time :(

nonsense.

was a long day ...

it was right all along.

this patch just makes it more explicit,
so it won't hurt, anyways.

but the code as it was before is working correctly.

so I guess you are back to those options below ;)

> > if that does not help:
> > 8.3.2?
> > 8.3.3rc2?
> > various other drbd versions? kernels?
> > different lower level device? (not cciss? other cciss drive/partition?)
> > etc.
> >
> > if all else fails: contact linbit, we do sell support.
> >
> > we even sell "drbd health checks", which somewhat boils down to a
> > one-time engagement - though for those you may need to wait for a
> > suitable (for linbit) time-slot.

--
: Lars Ellenberg
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


gianluca.cecchi at gmail

Sep 10, 2009, 2:31 PM

Post #8 of 16 (2176 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Thu, Sep 10, 2009 at 11:27 PM, Lars Ellenberg
<lars.ellenberg [at] linbit>wrote:

> On Thu, Sep 10, 2009 at 07:11:02PM +0200, Lars Ellenberg wrote:
> > On Thu, Sep 10, 2009 at 06:47:24PM +0200, Lars Ellenberg wrote:
> > > then something is wrong with your hardware, or your setup.
> > > or your kernel.
> > > or, of course, maybe only something is wrong with drbd (in your setup
> on
> > > your hardware ;-])
> > >
> > > care to try
> > > no-disk-flushes;
> > > no-md-flushes;
> > > no-disk-barrier;
> > > ?
>
>
> > then add below patch,
>
> ...
>
> > and the fallback error path in there apparently has been broken for a
> > long time :(
>
> nonsense.
>
> was a long day ...
>
> it was right all along.
>
> this patch just makes it more explicit,
> so it won't hurt, anyways.
>
> but the code as it was before is working correctly.
>
> so I guess you are back to those options below ;)
>
> > > if that does not help:
> > > 8.3.2?
> > > 8.3.3rc2?
> > > various other drbd versions? kernels?
> > > different lower level device? (not cciss? other cciss drive/partition?)
> > > etc.
> > >
> > > if all else fails: contact linbit, we do sell support.
> > >
> > > we even sell "drbd health checks", which somewhat boils down to a
> > > one-time engagement - though for those you may need to wait for a
> > > suitable (for linbit) time-slot.
>
>
No problem!

I have been testing for about 2 months with 8.2, 8.3.2 and 8.3.3rc1 in this
hw config without this kind of problem.
The OS is F11 and the kernel was always based on 2.6.29.
The main thing changed few days ago was F11 passing to kernel 2.6.30 and me
to update it (but I still have the 2.6.29 based one).
So one of the causes could probably be this.

I can test several possibilities tomorrow @office.
What order do you prefer, considering the variables outlined:

- kernel 2.6.29 vs 2.6.30
- drdb 8.3.3rc1 vs 8.3.3rc2
- applying the patch proposed (step no more needed)
- changing what suggested in drbd.conf

In this state, would a "service drbd stop" of the diskless peer succeed? And
the other one? Would I come back to the same diskless state after
reboot/drbd reload?

If you give me some ordered steps I'm available to follow them in order to
provide the more useful information you would need.


gianluca.cecchi at gmail

Sep 11, 2009, 1:59 AM

Post #9 of 16 (2165 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

Ok,
it seemed to me that the simpler, more advisable and more useful thing for
testing drbd too was to update drbd to 8.3.3rc2, keeping the same the drbd
conf and the 2.6.30 kernel (just recently updated to 2.6.30 stream in f11 as
I wrote before...)
And that WAS the right approach (at least at this time).

After starting the Primary/UpToDate node, it was in this state
...
Starting DRBD resources: [ d(r0) s(r0) n(r0) ]...
[root [at] virtfe x86_64]# service drbd status
drbd driver loaded OK; device status:
version: 8.3.3rc2 (api:88/proto:86-91)
GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by root [at] virtfedbi,
2009-09-11 10:06:20
m:res cs ro ds p mounted fstype
0:r0 WFConnection Primary/Unknown UpToDate/Outdated C

Doing a "service drbd start" on the peer, I get this on the other one
Sep 11 10:44:03 virtfed kernel: block drbd0: Handshake successful: Agreed
network protocol version 91
Sep 11 10:44:03 virtfed kernel: block drbd0: Peer authenticated using 20
bytes of 'sha1' HMAC
Sep 11 10:44:03 virtfed kernel: block drbd0: conn( WFConnection ->
WFReportParams )
Sep 11 10:44:03 virtfed kernel: block drbd0: Starting asender thread (from
drbd0_receiver [11115])
Sep 11 10:44:03 virtfed kernel: block drbd0: data-integrity-alg: <not-used>
Sep 11 10:44:03 virtfed kernel: block drbd0: drbd_sync_handshake:
Sep 11 10:44:03 virtfed kernel: block drbd0: self
A0332E51B243BEE1:7C12A37C6FB9B1CB:DB97F5F6C5FBB26C:FAFACA8496A4ED9D
bits:79098 flags:0
Sep 11 10:44:03 virtfed kernel: block drbd0: peer
7C12A37C6FB9B1CA:0000000000000000:0DB564243F5AA9A3:377245292BBD1112
bits:235520 flags:2
Sep 11 10:44:03 virtfed kernel: block drbd0: uuid_compare()=1 by rule 70
Sep 11 10:44:03 virtfed kernel: block drbd0: Becoming sync source due to
disk states.
Sep 11 10:44:03 virtfed kernel: block drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
Sep 11 10:44:03 virtfed kernel: block drbd0: peer( Secondary -> Primary )
Sep 11 10:44:03 virtfed kernel: block drbd0: conn( WFBitMapS -> SyncSource )

Sep 11 10:44:03 virtfed kernel: block drbd0: Began resync as SyncSource
(will sync 1258472 KB [314618 bits set]).
Sep 11 10:44:23 virtfed kernel: block drbd0: Resync done (total 19 sec;
paused 0 sec; 66232 K/sec)
Sep 11 10:44:23 virtfed kernel: block drbd0: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )

No messages on peer node because I started it in single user mode and
manually started the network and sshd daemon... and then drbd, so the
messages file was not populated, but dmesg gives same information I think:

drbd: initialized. Version: 8.3.3rc2 (api:88/proto:86-91)
drbd: GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by
root [at] virtfedbi, 2009-09-11 10:06:20
drbd: registered as block device major 147
drbd: minor_table @ 0xffff880826f09b00
block drbd0: Starting worker thread (from cqueue [1833])
block drbd0: disk( Diskless -> Attaching )
block drbd0: Found 6 transactions (244 active extents) in activity log.
block drbd0: Method to ensure write ordering: barrier
block drbd0: max_segment_size ( = BIO size ) = 32768
block drbd0: drbd_bm_resize called with capacity == 109317376
block drbd0: resync bitmap: bits=13664672 words=213511
block drbd0: size = 52 GB (54658688 KB)
block drbd0: recounting of set bits took additional 1 jiffies
block drbd0: 920 MB (235520 bits) marked out-of-sync by on disk bit-map.
block drbd0: Marked additional 0 KB as out-of-sync based on AL.
end_request: I/O error, dev cciss/c0d0, sector 0
block drbd0: meta data flush failed with status -95, disabling md-flushes
block drbd0: disk( Attaching -> Inconsistent )
block drbd0: conn( StandAlone -> Unconnected )
block drbd0: Starting receiver thread (from drbd0_worker [1835])
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection )
block drbd0: Handshake successful: Agreed network protocol version 91
block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
block drbd0: conn( WFConnection -> WFReportParams )
block drbd0: Starting asender thread (from drbd0_receiver [1861])
block drbd0: data-integrity-alg: <not-used>
block drbd0: drbd_sync_handshake:
block drbd0: self
7C12A37C6FB9B1CA:0000000000000000:0DB564243F5AA9A3:377245292BBD1112
bits:235520 flags:0
block drbd0: peer
A0332E51B243BEE1:7C12A37C6FB9B1CB:DB97F5F6C5FBB26C:FAFACA8496A4ED9D
bits:79098 flags:0
block drbd0: uuid_compare()=-1 by rule 50
block drbd0: Becoming sync target due to disk states.
block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT )
pdsk( DUnknown -> UpToDate )
block drbd0: role( Secondary -> Primary )
block drbd0: conn( WFBitMapT -> WFSyncUUID )
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit
code 0 (0x0)
block drbd0: conn( WFSyncUUID -> SyncTarget )
block drbd0: Began resync as SyncTarget (will sync 1258472 KB [314618 bits
set]).
block drbd0: Resync done (total 19 sec; paused 0 sec; 66232 K/sec)
block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate
)
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit
code 0 (0x0)

Now the situation is correctly at:

[root [at] virtfedbi ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.3.3rc2 (api:88/proto:86-91)
GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by root [at] virtfedbi,
2009-09-11 10:06:20
m:res cs ro ds p mounted fstype
0:r0 Connected Primary/Primary UpToDate/UpToDate C

During the sync phase (some seconds):
[root [at] virtfedbi ~]# cat /proc/drbd
version: 8.3.3rc2 (api:88/proto:86-91)
GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by root [at] virtfedbi,
2009-09-11 10:06:20
0: cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate C r----
ns:0 nr:1186920 dw:1186824 dr:56 al:0 bm:383 lo:4 pe:2236 ua:3 ap:0 ep:1
wo:b oos:71648
[=================>..] sync'ed: 94.5% (71648/1258472)K
finish: 0:00:01 speed: 64,736 (65,932) K/sec

Notice that I rebooted both the nodes so the network interfaces, during the
start of the peer and the sync was in original state:

[root [at] virtfedbi ~]# ethtool -k eth3
Offload parameters for eth3:
Cannot get device flags: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

Thanks for the answers and support!
Gianluca


gianluca.cecchi at gmail

Sep 15, 2009, 3:06 AM

Post #10 of 16 (2093 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

I'm again here.

After a little the bad diskless situation restarted also with 8.3.3rc2.
What the first change or action to debug to try?
In particular, I don't understand if the line in dmesg

end_request: I/O error, dev cciss/c0d0, sector 0

is reported by drbd or scsi layer or what...
In theory drbd should not try to read sector 0 at all as It uses c0d0p3 and
c0d0p4.....
or does this mean that it is not able to read the partition table....?
Thanks for help
Gianluca

On the problematic node I have:

drbd: initialized. Version: 8.3.3rc2 (api:88/proto:86-91)
drbd: GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by
root [at] virtfedbi, 2009-09-11 10:06:20
drbd: registered as block device major 147
drbd: minor_table @ 0xffff880829c04700
block drbd0: Starting worker thread (from cqueue [1779])
block drbd0: disk( Diskless -> Attaching )
block drbd0: Found 6 transactions (244 active extents) in activity log.
block drbd0: Method to ensure write ordering: barrier
block drbd0: max_segment_size ( = BIO size ) = 32768
block drbd0: drbd_bm_resize called with capacity == 109317376
block drbd0: resync bitmap: bits=13664672 words=213511
block drbd0: size = 52 GB (54658688 KB)
block drbd0: recounting of set bits took additional 2 jiffies
block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd0: Marked additional 920 MB as out-of-sync based on AL.
end_request: I/O error, dev cciss/c0d0, sector 0
block drbd0: meta data flush failed with status -95, disabling md-flushes
block drbd0: disk( Attaching -> Consistent )
block drbd0: conn( StandAlone -> Unconnected )
block drbd0: Starting receiver thread (from drbd0_worker [1781])
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection )
block drbd0: Handshake successful: Agreed network protocol version 91
block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
block drbd0: conn( WFConnection -> WFReportParams )
block drbd0: Starting asender thread (from drbd0_receiver [1806])
block drbd0: data-integrity-alg: <not-used>
block drbd0: drbd_sync_handshake:
block drbd0: self
71E806A9BE572C28:0000000000000000:9EB0CCB7634CBCDC:A0332E51B243BEE1
bits:235520 flags:0
block drbd0: peer
81BAD3F384A6F3C7:71E806A9BE572C29:9EB0CCB7634CBCDD:A0332E51B243BEE1
bits:109854 flags:0
block drbd0: uuid_compare()=-1 by rule 50
block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT )
disk( Consistent -> Outdated ) pdsk( DUnknown -> UpToDate )
block drbd0: role( Secondary -> Primary )
block drbd0: conn( WFBitMapT -> WFSyncUUID )
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit
code 0 (0x0)
block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent
)
block drbd0: Began resync as SyncTarget (will sync 1381496 KB [345374 bits
set]).
block drbd0: Resync aborted.
block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> Failed )
block drbd0: Local IO failed. Detaching...
block drbd0: Can not write resync data to local disk.
block drbd0: Can not write resync data to local disk.
block drbd0: drbd_rs_complete_io() called, but extent not found
block drbd0: drbd_rs_complete_io() called, but extent not found
block drbd0: drbd_rs_complete_io() called, but extent not found
block drbd0: drbd_rs_complete_io() called, but extent not found
block drbd0: drbd_rs_complete_io() called, but extent not found
block drbd0: drbd_rs_complete_io() called, but extent not found
block drbd0: Can not write resync data to local disk.
block drbd0: Can not write resync data to local disk.
block drbd0: Can not write resync data to local disk.
block drbd0: drbd_rs_complete_io() called, but extent not found
block drbd0: drbd_rs_complete_io() called, but extent not found
block drbd0: drbd_rs_complete_io() called, but extent not found
block drbd0: disk( Failed -> Diskless )
block drbd0: Notified peer that my disk is broken.
block drbd0: Can not write mirrored data block to local disk.

On the peer node I have:
Sep 15 11:46:29 virtfed kernel: block drbd0: Handshake successful: Agreed
network protocol version 91
Sep 15 11:46:29 virtfed kernel: block drbd0: Peer authenticated using 20
bytes of 'sha1' HMAC
Sep 15 11:46:29 virtfed kernel: block drbd0: conn( WFConnection ->
WFReportParams )
Sep 15 11:46:29 virtfed kernel: block drbd0: Starting asender thread (from
drbd0_receiver [2450])
Sep 15 11:46:29 virtfed kernel: block drbd0: data-integrity-alg: <not-used>
Sep 15 11:46:29 virtfed kernel: block drbd0: drbd_sync_handshake:
Sep 15 11:46:29 virtfed kernel: block drbd0: self
81BAD3F384A6F3C7:71E806A9BE572C29:9EB0CCB7634CBCDD:A0332E51B243BEE1
bits:109854 flags:0
Sep 15 11:46:29 virtfed kernel: block drbd0: peer
71E806A9BE572C28:0000000000000000:9EB0CCB7634CBCDC:A0332E51B243BEE1
bits:235520 flags:2
Sep 15 11:46:29 virtfed kernel: block drbd0: uuid_compare()=1 by rule 70
Sep 15 11:46:29 virtfed kernel: block drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS )
Sep 15 11:46:29 virtfed kernel: block drbd0: peer( Secondary -> Primary )
Sep 15 11:46:29 virtfed kernel: block drbd0: conn( WFBitMapS -> SyncSource )
pdsk( Outdated -> Inconsistent )
Sep 15 11:46:29 virtfed kernel: block drbd0: Began resync as SyncSource
(will sync 1381496 KB [345374 bits set]).
Sep 15 11:46:31 virtfed kernel: block drbd0: Got NegAck packet. Peer is in
troubles?
Sep 15 11:46:31 virtfed kernel: block drbd0: Got NegAck packet. Peer is in
troubles?
Sep 15 11:46:31 virtfed kernel: block drbd0: Got NegAck packet. Peer is in
troubles?
Sep 15 11:46:31 virtfed kernel: block drbd0: Got NegAck packet. Peer is in
troubles?
Sep 15 11:46:31 virtfed kernel: block drbd0: Got NegAck packet. Peer is in
troubles?
Sep 15 11:46:32 virtfed kernel: block drbd0: Resync aborted.
Sep 15 11:46:32 virtfed kernel: block drbd0: conn( SyncSource -> Connected )
pdsk( Inconsistent -> Diskless )
Sep 15 11:46:32 virtfed kernel: block drbd0: Not sending RSDataReply,
partner DISKLESS!
Sep 15 11:46:32 virtfed kernel: block drbd0: Not sending RSDataReply,
partner DISKLESS!
Sep 15 11:46:32 virtfed kernel: block drbd0: Not sending RSDataReply,
partner DISKLESS!
Sep 15 11:46:32 virtfed kernel: block drbd0: Not sending RSDataReply,
partner DISKLESS!
Sep 15 11:46:32 virtfed kernel: block drbd0: Not sending RSDataReply,
partner DISKLESS!

My drbd.conf at the moment:
global {
usage-count yes;
}

common {
syncer { rate 100M; }
}

resource r0 {
protocol C;

handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
halt -f";
fence-peer "/usr/local/bin/obliterate-peer.sh";
}

startup {
wfc-timeout 30;
degr-wfc-timeout 10; # 2 minutes.
outdated-wfc-timeout 2; # 2 seconds.
become-primary-on both;
}

disk {
on-io-error detach;
fencing resource-and-stonith;
}

net {
allow-two-primaries;
cram-hmac-alg "sha1";
shared-secret "kvmdrbd";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
rr-conflict disconnect;
}

syncer {
al-extents 257;
}

on virtfed {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 192.168.16.111:7788;
flexible-meta-disk /dev/cciss/c0d0p4;

}

on virtfedbis {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 192.168.16.112:7788;
flexible-meta-disk /dev/cciss/c0d0p4;
}
}


lars.ellenberg at linbit

Sep 15, 2009, 5:50 AM

Post #11 of 16 (2084 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Tue, Sep 15, 2009 at 12:06:28PM +0200, Gianluca Cecchi wrote:
> I'm again here.
>
> After a little the bad diskless situation restarted also with 8.3.3rc2.
> What the first change or action to debug to try?
> In particular, I don't understand if the line in dmesg
>
> end_request: I/O error, dev cciss/c0d0, sector 0
>
> is reported by drbd or scsi layer or what...

generic block layer.

> In theory drbd should not try to read sector 0 at all as It uses c0d0p3 and
> c0d0p4.....

who says its DRBD that is reading there.
who says that has been a read request.

> or does this mean that it is not able to read the partition table....?

dunno.

> Thanks for help
> Gianluca
>
> On the problematic node I have:
>
> drbd: initialized. Version: 8.3.3rc2 (api:88/proto:86-91)

please try current git, if you can.
http://git.drbd.org/?p=drbd-8.3.git;a=summary
there has been one regression in this area
somewhere between 8.3.2 and 8.3.3rc1,
which now is fixed again.


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


gianluca.cecchi at gmail

Sep 15, 2009, 6:35 AM

Post #12 of 16 (2075 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Tue, Sep 15, 2009 at 2:50 PM, Lars Ellenberg
<lars.ellenberg [at] linbit>wrote:

> On Tue, Sep 15, 2009 at 12:06:28PM +0200, Gianluca Cecchi wrote:
> [snip]
>
> > In theory drbd should not try to read sector 0 at all as It uses c0d0p3
> and
> > c0d0p4.....
>
> who says its DRBD that is reading there.
> who says that has been a read request.
>

They were only assumptions, as the I/O errors came in-between the drbd
messages....
The questions posted were indeed because I would like anyone give insight
about my doubts.... ;-)

> [snip]
>
>
> please try current git, if you can.
> http://git.drbd.org/?p=drbd-8.3.git;a=summary
> there has been one regression in this area
> somewhere between 8.3.2 and 8.3.3rc1,
> which now is fixed again.
>

I would like but I'm behind a proxy.
I tried some configurations for proxy, searching how to use git through a
proxy, but I can for example get git for wine, but not for drbd.
Do you serve your git repository through http too?
Attempting this:
git clone http://git.drbd.org/drbd-8.3

I get:
Initialized empty Git repository in /home/drbd/git_150909/drbd-8.3/.git/
fatal: http://git.drbd.org/drbd-8.3/info/refs not found: did you run git
update-server-info on the server?

while for example
git clone http://source.winehq.org/git/wine.git wine
Initialized empty Git repository in /home/wine/git_150909/wine/.git/
Getting alternates list for http://source.winehq.org/git/wine.git
Getting pack list for http://source.winehq.org/git/wine.git
Getting index for pa......
ecc

Excuse me: I'm available to try git version but not so expert with git
itself...

Gianluca


lars.ellenberg at linbit

Sep 16, 2009, 5:20 AM

Post #13 of 16 (2052 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Tue, Sep 15, 2009 at 03:35:46PM +0200, Gianluca Cecchi wrote:
> > please try current git, if you can.
> > http://git.drbd.org/?p=drbd-8.3.git;a=summary
> > there has been one regression in this area
> > somewhere between 8.3.2 and 8.3.3rc1,
> > which now is fixed again.
> >
>
> I would like but I'm behind a proxy.
> I tried some configurations for proxy, searching how to use git through a
> proxy, but I can for example get git for wine, but not for drbd.
> Do you serve your git repository through http too?
> Attempting this:
> git clone http://git.drbd.org/drbd-8.3

git clone http://git.drbd.org/drbd-8.3.git

might work. a bit many git in there, I know,
but we are dealing with redundancy anyways, after all.

though the git:// protocol is faster and prefered.

in case said regression should be the reason for your trouble, of course
you could also go back to 8.3.2 (which does not contain that
regression), or wait for 8.3.3 final.

or, as I suggested earlier, add "no-disk-barrier; no-disk-flushes;
no-md-flushes;" to your disk {} section, which would be a valid
work-around for said regression.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


gianluca.cecchi at gmail

Sep 16, 2009, 7:20 AM

Post #14 of 16 (2049 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

With drbd git installed on peer and rebooting it, while maintaining the
source as 8.3.3rc2 it succeeds in synchronization now.

I have
[root [at] virtfedbi ~]# cat /proc/drbd
version: 8.3.3rc2 (api:88/proto:86-91)
GIT-hash: 0acb7c07a61225ba880fde2a32b8f5f8fa49c8cc build by root [at] virtfedbi,
2009-09-16 16:01:09
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
ns:0 nr:1234556 dw:1234556 dr:56 al:0 bm:345 lo:0 pe:0 ua:0 ap:0 ep:1
wo:d oos:0

[root [at] virtfe ~]# cat /proc/drbd
version: 8.3.3rc2 (api:88/proto:86-91)
GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by root [at] virtfedbi,
2009-09-11 10:06:20
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
ns:2277548 nr:0 dw:22925304 dr:5310672 al:404 bm:794 lo:0 pe:0 ua:0 ap:0
ep:1 wo:b oos:0

I'm going to update virtfed too.....

Messages in virtfed
Sep 16 16:08:04 virtfed kernel: block drbd0: Handshake successful: Agreed
network protocol version 91
Sep 16 16:08:04 virtfed kernel: block drbd0: Peer authenticated using 20
bytes of 'sha1' HMAC
Sep 16 16:08:04 virtfed kernel: block drbd0: conn( WFConnection ->
WFReportParams )
Sep 16 16:08:04 virtfed kernel: block drbd0: Starting asender thread (from
drbd0_receiver [2450])
Sep 16 16:08:04 virtfed kernel: block drbd0: data-integrity-alg: <not-used>
Sep 16 16:08:04 virtfed kernel: block drbd0: drbd_sync_handshake:
Sep 16 16:08:04 virtfed kernel: block drbd0: self
81BAD3F384A6F3C7:7FC155C9F5183159:13247E4B98A2B256:71E806A9BE572C29
bits:308176 flags:0
Sep 16 16:08:04 virtfed kernel: block drbd0: peer
7FC155C9F5183158:0000000000000000:9EB0CCB7634CBCDC:A0332E51B243BEE1
bits:235520 flags:2
Sep 16 16:08:04 virtfed kernel: block drbd0: uuid_compare()=1 by rule 70
Sep 16 16:08:04 virtfed kernel: block drbd0: Becoming sync source due to
disk states.
Sep 16 16:08:04 virtfed kernel: block drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
Sep 16 16:08:04 virtfed kernel: block drbd0: peer( Secondary -> Primary )
Sep 16 16:08:04 virtfed kernel: block drbd0: conn( WFBitMapS -> SyncSource )

Sep 16 16:08:04 virtfed kernel: block drbd0: Began resync as SyncSource
(will sync 1232704 KB [308176 bits set]).
Sep 16 16:08:22 virtfed kernel: block drbd0: Resync done (total 18 sec;
paused 0 sec; 68480 K/sec)
Sep 16 16:08:22 virtfed kernel: block drbd0: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )

dmesg on virtfedbis:
drbd: initialized. Version: 8.3.3rc2 (api:88/proto:86-91)
drbd: GIT-hash: 0acb7c07a61225ba880fde2a32b8f5f8fa49c8cc build by
root [at] virtfedbi, 2009-09-16 16:01:09
drbd: registered as block device major 147
drbd: minor_table @ 0xffff880824d16800
block drbd0: Starting worker thread (from cqueue [1903])
block drbd0: disk( Diskless -> Attaching )
block drbd0: Found 6 transactions (244 active extents) in activity log.
block drbd0: Method to ensure write ordering: barrier
block drbd0: max_segment_size ( = BIO size ) = 32768
block drbd0: drbd_bm_resize called with capacity == 109317376
block drbd0: resync bitmap: bits=13664672 words=213511
block drbd0: size = 52 GB (54658688 KB)
block drbd0: recounting of set bits took additional 2 jiffies
block drbd0: 920 MB (235520 bits) marked out-of-sync by on disk bit-map.
block drbd0: Marked additional 0 KB as out-of-sync based on AL.
end_request: I/O error, dev cciss/c0d0, sector 0
block drbd0: meta data flush failed with status -95, disabling md-flushes
block drbd0: disk( Attaching -> Inconsistent )
block drbd0: conn( StandAlone -> Unconnected )
block drbd0: Starting receiver thread (from drbd0_worker [1911])
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection )
block drbd0: Handshake successful: Agreed network protocol version 91
block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
block drbd0: conn( WFConnection -> WFReportParams )
block drbd0: Starting asender thread (from drbd0_receiver [1931])
block drbd0: data-integrity-alg: <not-used>
block drbd0: drbd_sync_handshake:
block drbd0: self
7FC155C9F5183158:0000000000000000:9EB0CCB7634CBCDC:A0332E51B243BEE1
bits:235520 flags:0
block drbd0: peer
81BAD3F384A6F3C7:7FC155C9F5183159:13247E4B98A2B256:71E806A9BE572C29
bits:308176 flags:0
block drbd0: uuid_compare()=-1 by rule 50
block drbd0: Becoming sync target due to disk states.
block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT )
pdsk( DUnknown -> UpToDate )
block drbd0: role( Secondary -> Primary )
block drbd0: conn( WFBitMapT -> WFSyncUUID )
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit
code 0 (0x0)
block drbd0: conn( WFSyncUUID -> SyncTarget )
block drbd0: Began resync as SyncTarget (will sync 1232704 KB [308176 bits
set]).
block drbd0: write: error=-95 s=39967568s
block drbd0: Method to ensure write ordering: flush
end_request: I/O error, dev cciss/c0d0, sector 0
block drbd0: local disk flush failed with status -95
block drbd0: Method to ensure write ordering: drain
block drbd0: Resync done (total 18 sec; paused 0 sec; 68480 K/sec)
block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate
)
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit
code 0 (0x0)

Do the messages regarding " local disk flush failed with status -95" suggest
to anyway applying all the changes:
no-disk-barrier;
no-disk-flushes;
no-md-flushes;

drawbacks about these?
Let's go and see if it is stable now....

Gianluca

On Wed, Sep 16, 2009 at 2:20 PM, Lars Ellenberg
<lars.ellenberg [at] linbit>wrote:

> On Tue, Sep 15, 2009 at 03:35:46PM +0200, Gianluca Cecchi wrote:
> > > please try current git, if you can.
> > > http://git.drbd.org/?p=drbd-8.3.git;a=summary
> > > there has been one regression in this area
> > > somewhere between 8.3.2 and 8.3.3rc1,
> > > which now is fixed again.
> > >
> >
> > I would like but I'm behind a proxy.
> > I tried some configurations for proxy, searching how to use git through a
> > proxy, but I can for example get git for wine, but not for drbd.
> > Do you serve your git repository through http too?
> > Attempting this:
> > git clone http://git.drbd.org/drbd-8.3
>
> git clone http://git.drbd.org/drbd-8.3.git
>
> might work. a bit many git in there, I know,
> but we are dealing with redundancy anyways, after all.
>
> though the git:// protocol is faster and prefered.
>
> in case said regression should be the reason for your trouble, of course
> you could also go back to 8.3.2 (which does not contain that
> regression), or wait for 8.3.3 final.
>
> or, as I suggested earlier, add "no-disk-barrier; no-disk-flushes;
> no-md-flushes;" to your disk {} section, which would be a valid
> work-around for said regression.
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>


lars.ellenberg at linbit

Sep 16, 2009, 8:23 AM

Post #15 of 16 (2043 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Wed, Sep 16, 2009 at 04:20:12PM +0200, Gianluca Cecchi wrote:
> With drbd git installed on peer and rebooting it, while maintaining the
> source as 8.3.3rc2 it succeeds in synchronization now.

ok.

> Do the messages regarding " local disk flush failed with status -95" suggest
> to anyway applying all the changes:
> no-disk-barrier;
> no-disk-flushes;
> no-md-flushes;

yes.

> drawbacks about these?

well, since barriers or flushes apparently are not supported
(95: EOPNOTSUPP), DRBD will disable them after the first failure,
effectively enabling the "no-*" options above, falling back to
"drain" method.

the regression has been in this "detect first barrier failure,
retry and enable fallback" code. I tried to fix a potential endless
loop, thereby introducing that regression by a typo.

> Let's go and see if it is stable now....

keep us posted.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


gianluca.cecchi at gmail

Sep 16, 2009, 9:34 AM

Post #16 of 16 (2062 views)
Permalink
Re: recovering from "Local IO failed. Detaching..." [In reply to]

On Wed, Sep 16, 2009 at 5:23 PM, Lars Ellenberg
<lars.ellenberg [at] linbit>wrote:

> [snip]
>
> keep us posted.
>
>
BTW after the drbd.conf modifications I don't get anymore the message

Sep 16 16:22:29 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
sector 0

inside the block:

Sep 16 16:22:29 virtfedbis kernel: block drbd0: Began resync as SyncTarget
(will sync 1232704 KB [308176 bits set]).
Sep 16 16:22:29 virtfedbis kernel: block drbd0: write: error=-95 s=39967568s
Sep 16 16:22:29 virtfedbis kernel: block drbd0: Method to ensure write
ordering: flush
Sep 16 16:22:29 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
sector 0
Sep 16 16:22:29 virtfedbis kernel: block drbd0: local disk flush failed with
status -95
Sep 16 16:22:29 virtfedbis kernel: block drbd0: Method to ensure write
ordering: drain
Sep 16 16:22:29 virtfedbis kernel: block drbd0: Resync done (total 18 sec;
paused 0 sec; 68480 K/sec)
Sep 16 16:22:29 virtfedbis kernel: block drbd0: conn( SyncTarget ->
Connected ) disk( Inconsistent -> UpToDate )

So it was indeed generated by drbd testing barriers/flushes. Now I have
only:

Sep 16 18:25:09 virtfedbis kernel: block drbd0: Began resync as SyncTarget
(will sync 942080 KB [235520 bits set]).
Sep 16 18:25:25 virtfedbis kernel: block drbd0: Resync done (total 15 sec;
paused 0 sec; 62804 K/sec)
Sep 16 18:25:25 virtfedbis kernel: block drbd0: conn( SyncTarget ->
Connected ) disk( Inconsistent -> UpToDate )

Gianluca

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.