Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

Concurrent writes

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


parakie at gmail

Apr 15, 2009, 12:06 PM

Post #1 of 14 (2681 views)
Permalink
Concurrent writes

I've been seeing "Concurrent local write" messages under certain workloads
and environments - in particular this has been observed with a SCST target
running on top of DRBD, and benchmarks running on both bare windows
initiators and windows virtuals through ESX initiator. What does this
warning actually entail, and why would it happen?

Thanks,

-Gennadiy


lars.ellenberg at linbit

Apr 16, 2009, 12:31 AM

Post #2 of 14 (2585 views)
Permalink
Re: Concurrent writes [In reply to]

On Wed, Apr 15, 2009 at 03:06:39PM -0400, Gennadiy Nerubayev wrote:
> I've been seeing "Concurrent local write" messages under certain workloads
> and environments - in particular this has been observed with a SCST target
> running on top of DRBD, and benchmarks running on both bare windows
> initiators and windows virtuals through ESX initiator. What does this
> warning actually entail, and why would it happen?

if there is a write request A in flight (submitted, but not yet completed)
to offset a, with size x, and while this is still not completed yet
an other write request B is submitted to offset b with size y,
and these requests do overlap,

that is a "concurrent local write".

layers below DRBD may reorder writes.

which means these workloads violate write ordering constraints.

problem:
as DRBD replicates the requests, these writes might get reordered on the
other node as well. so the may end up on the lower level device in
different order.

as they do overlap, the resulting data on the both replicas
may end up being different.

submitting a new write request overlapping with an in flight write
request is bad practice on any IO subsystem, as it may violate write
ordering, and the result in general is undefined.

with DRBD in particular, it may even cause data divergence of the
replicas, _if_ the layers below DRBD on both nodes decide to reorder
these two requests. The likeliness of which is difficult to guess.

in short: DRBD detects that the layer using it is broken.

most likely it is simply the windos io stack that is broken,
as the initiators and targets involved simply forward the requests
issued by the windows file system and block device layer.

DRBD cannot help you with that. It simply is the only IO stack paranoid
enough to actually _check_ for that condition, and report it.
because DRBD promises to create _exact_, bitwise identical, replicas.
and this condition in general may cause data divergence.

--
: Lars Ellenberg
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


parakie at gmail

Apr 17, 2009, 3:52 PM

Post #3 of 14 (2561 views)
Permalink
Re: Concurrent writes [In reply to]

On Thu, Apr 16, 2009 at 3:31 AM, Lars Ellenberg
<lars.ellenberg [at] linbit>wrote:

> On Wed, Apr 15, 2009 at 03:06:39PM -0400, Gennadiy Nerubayev wrote:
> > I've been seeing "Concurrent local write" messages under certain
> workloads
> > and environments - in particular this has been observed with a SCST
> target
> > running on top of DRBD, and benchmarks running on both bare windows
> > initiators and windows virtuals through ESX initiator. What does this
> > warning actually entail, and why would it happen?
>
> if there is a write request A in flight (submitted, but not yet completed)
> to offset a, with size x, and while this is still not completed yet
> an other write request B is submitted to offset b with size y,
> and these requests do overlap,
>
> that is a "concurrent local write".
>
> layers below DRBD may reorder writes.
>
> which means these workloads violate write ordering constraints.
>
> problem:
> as DRBD replicates the requests, these writes might get reordered on the
> other node as well. so the may end up on the lower level device in
> different order.
>
> as they do overlap, the resulting data on the both replicas
> may end up being different.
>
> submitting a new write request overlapping with an in flight write
> request is bad practice on any IO subsystem, as it may violate write
> ordering, and the result in general is undefined.
>
> with DRBD in particular, it may even cause data divergence of the
> replicas, _if_ the layers below DRBD on both nodes decide to reorder
> these two requests. The likeliness of which is difficult to guess.
>
> in short: DRBD detects that the layer using it is broken.
>
> most likely it is simply the windos io stack that is broken,
> as the initiators and targets involved simply forward the requests
> issued by the windows file system and block device layer.
>
> DRBD cannot help you with that. It simply is the only IO stack paranoid
> enough to actually _check_ for that condition, and report it.
> because DRBD promises to create _exact_, bitwise identical, replicas.
> and this condition in general may cause data divergence.


Thanks for the detailed information. I've been able to confirm that it only
happens when blockio mode is used, with both SCST and IET, on either ESX
(windows or linux virtuals) or Windows as initiator. I've also observed that
it only apparently happens during random io workloads, and not sequential.
What I'm still trying to understand is the following:

1. Whether the write concurrency (obviously without warnings) would still
happen to the bare disk if you remove the DRBD layer, which seems unlikely
due to so many usage cases involving blockio targets, both with and without
DRBD
2. What's the worst case scenario (lost write? corrupt data? unknown
consistency?) that can result from concurrent writes?
3. How would one be able to verify that #2 happened?
4. I'd think others would report concurrency warnings as well due to the
relatively common usage scenario (and google does show a few hits), but
people have yet to actually report an actual problem..

Thanks,

-Gennadiy


lars.ellenberg at linbit

Apr 20, 2009, 3:13 AM

Post #4 of 14 (2523 views)
Permalink
Re: Concurrent writes [In reply to]

On Fri, Apr 17, 2009 at 06:52:00PM -0400, Gennadiy Nerubayev wrote:
> On Thu, Apr 16, 2009 at 3:31 AM, Lars Ellenberg
> <lars.ellenberg [at] linbit>wrote:
>
> > On Wed, Apr 15, 2009 at 03:06:39PM -0400, Gennadiy Nerubayev wrote:
> > > I've been seeing "Concurrent local write" messages under certain
> > workloads
> > > and environments - in particular this has been observed with a SCST
> > target
> > > running on top of DRBD, and benchmarks running on both bare windows
> > > initiators and windows virtuals through ESX initiator. What does this
> > > warning actually entail, and why would it happen?
> >
> > if there is a write request A in flight (submitted, but not yet completed)
> > to offset a, with size x, and while this is still not completed yet
> > an other write request B is submitted to offset b with size y,
> > and these requests do overlap,
> >
> > that is a "concurrent local write".
> >
> > layers below DRBD may reorder writes.
> >
> > which means these workloads violate write ordering constraints.
> >
> > problem:
> > as DRBD replicates the requests, these writes might get reordered on the
> > other node as well. so the may end up on the lower level device in
> > different order.
> >
> > as they do overlap, the resulting data on the both replicas
> > may end up being different.
> >
> > submitting a new write request overlapping with an in flight write
> > request is bad practice on any IO subsystem, as it may violate write
> > ordering, and the result in general is undefined.
> >
> > with DRBD in particular, it may even cause data divergence of the
> > replicas, _if_ the layers below DRBD on both nodes decide to reorder
> > these two requests. The likeliness of which is difficult to guess.
> >
> > in short: DRBD detects that the layer using it is broken.
> >
> > most likely it is simply the windos io stack that is broken,
> > as the initiators and targets involved simply forward the requests
> > issued by the windows file system and block device layer.
> >
> > DRBD cannot help you with that. It simply is the only IO stack paranoid
> > enough to actually _check_ for that condition, and report it.
> > because DRBD promises to create _exact_, bitwise identical, replicas.
> > and this condition in general may cause data divergence.
>
>
> Thanks for the detailed information. I've been able to confirm that it only
> happens when blockio mode is used,

when using fileio, these writes pass through the linux page cache first,
which will "filter out" such "double writes" to the same area.

> with both SCST and IET, on either ESX
> (windows or linux virtuals) or Windows as initiator. I've also observed that
> it only apparently happens during random io workloads, and not sequential.
> What I'm still trying to understand is the following:
>
> 1. Whether the write concurrency (obviously without warnings) would still
> happen to the bare disk if you remove the DRBD layer, which seems unlikely
> due to so many usage cases involving blockio targets, both with and without
> DRBD

I'm pretty much certain that these do happen withouth DRBD,
and go unnoticed.

> 2. What's the worst case scenario (lost write? corrupt data? unknown
> consistency?) that can result from concurrent writes?

DRBD currently _drops_ the later write.
if a write is detected as a concurrent local write,
this one is never submitted nor send, and just "completed",
pretending that it had been successfully written.

we considered to _fail_ such writes with EIO,
but decided to rather complain loudly, but pretend success.

in short: if DRBD detects a "concurrent local write",
that write is lost.

> 3. How would one be able to verify that #2 happened?

low level compare with known good data.

> 4. I'd think others would report concurrency warnings as well due to the
> relatively common usage scenario (and google does show a few hits), but
> people have yet to actually report an actual problem..


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Apr 20, 2009, 3:28 AM

Post #5 of 14 (2527 views)
Permalink
Re: Concurrent writes [In reply to]

On Mon, Apr 20, 2009 at 12:13:58PM +0200, Lars Ellenberg wrote:
> > > in short: DRBD detects that the layer using it is broken.
> > >
> > > most likely it is simply the windos io stack that is broken,
> > > as the initiators and targets involved simply forward the requests
> > > issued by the windows file system and block device layer.
> > >
> > > DRBD cannot help you with that. It simply is the only IO stack paranoid
> > > enough to actually _check_ for that condition, and report it.
> > > because DRBD promises to create _exact_, bitwise identical, replicas.
> > > and this condition in general may cause data divergence.
> >
> >
> > Thanks for the detailed information. I've been able to confirm that it only
> > happens when blockio mode is used,
>
> when using fileio, these writes pass through the linux page cache first,
> which will "filter out" such "double writes" to the same area.
>
> > with both SCST and IET, on either ESX
> > (windows or linux virtuals) or Windows as initiator. I've also observed that
> > it only apparently happens during random io workloads, and not sequential.
> > What I'm still trying to understand is the following:
> >
> > 1. Whether the write concurrency (obviously without warnings) would still
> > happen to the bare disk if you remove the DRBD layer, which seems unlikely
> > due to so many usage cases involving blockio targets, both with and without
> > DRBD
>
> I'm pretty much certain that these do happen withouth DRBD,
> and go unnoticed.
>
> > 2. What's the worst case scenario (lost write? corrupt data? unknown
> > consistency?) that can result from concurrent writes?
>
> DRBD currently _drops_ the later write.
> if a write is detected as a concurrent local write,
> this one is never submitted nor send, and just "completed",
> pretending that it had been successfully written.
>
> we considered to _fail_ such writes with EIO,
> but decided to rather complain loudly, but pretend success.

btw.

please post a few of the original log lines,
they should read something like
<comm>[pid] Concurrent local write detected!
[DISCARD L] new: <sector offset>s +<size in bytes>;
pending: <sector offset>s +<size in bytes>

I'm curious as to what the actual overlap is,
and in if there is any correlation between offsets.

> in short: if DRBD detects a "concurrent local write",
> that write is lost.
>
> > 3. How would one be able to verify that #2 happened?
>
> low level compare with known good data.
>
> > 4. I'd think others would report concurrency warnings as well due to the
> > relatively common usage scenario (and google does show a few hits), but
> > people have yet to actually report an actual problem..

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


parakie at gmail

Apr 20, 2009, 4:05 PM

Post #6 of 14 (2522 views)
Permalink
Re: Concurrent writes [In reply to]

On Mon, Apr 20, 2009 at 6:28 AM, Lars Ellenberg
<lars.ellenberg [at] linbit>wrote:

> On Mon, Apr 20, 2009 at 12:13:58PM +0200, Lars Ellenberg wrote:
> >
> > > 2. What's the worst case scenario (lost write? corrupt data? unknown
> > > consistency?) that can result from concurrent writes?
> >
> > DRBD currently _drops_ the later write.
> > if a write is detected as a concurrent local write,
> > this one is never submitted nor send, and just "completed",
> > pretending that it had been successfully written.
> >
> > we considered to _fail_ such writes with EIO,
> > but decided to rather complain loudly, but pretend success.
>

So to clarify, in a typical scenario an initiator should not be issuing a
write request while one for the same (or overlapping) block is not yet
returned by DRBD as successful, however that's what happens? Is it at all
possible that DRBD returns success earlier than it should have (obviously
I'm using protocol C)?

please post a few of the original log lines,
> they should read something like
> <comm>[pid] Concurrent local write detected!
> [DISCARD L] new: <sector offset>s +<size in bytes>;
> pending: <sector offset>s +<size in bytes>
>
> I'm curious as to what the actual overlap is,
> and in if there is any correlation between offsets.


Here's an example for 8k random writes:

Apr 14 12:24:35 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
detected! [DISCARD L] new: 162328976s +8192; pending: 162328976s +8192
Apr 14 12:24:38 srpt1 kernel: drbd0: scsi_tgt0[17271] Concurrent local write
detected! [DISCARD L] new: 161385248s +8192; pending: 161385248s +8192
Apr 14 12:24:39 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
detected! [DISCARD L] new: 157655888s +8192; pending: 157655888s +8192
Apr 14 12:24:39 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
detected! [DISCARD L] new: 165753872s +8192; pending: 165753872s +8192
Apr 14 12:24:39 srpt1 kernel: drbd0: scsi_tgt0[17271] Concurrent local write
detected! [DISCARD L] new: 166654816s +8192; pending: 166654816s +8192
Apr 14 12:24:40 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
detected! [DISCARD L] new: 158260592s +8192; pending: 158260592s +8192
Apr 14 12:24:40 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
detected! [DISCARD L] new: 163944704s +8192; pending: 163944704s +8192
Apr 14 12:24:49 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
detected! [DISCARD L] new: 169511744s +8192; pending: 169511744s +8192
Apr 14 12:24:51 srpt1 kernel: drbd0: scsi_tgt0[17271] Concurrent local write
detected! [DISCARD L] new: 170614416s +8192; pending: 170614416s +8192
Apr 14 12:24:52 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
detected! [DISCARD L] new: 158642368s +8192; pending: 158642368s +8192

128k:

Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
detected! [DISCARD L] new: 562689092s +28672; pending: 562689144s +4096
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
detected! [DISCARD L] new: 562689172s +20480; pending: 562689152s +32768
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
detected! [DISCARD L] new: 562689148s +2048; pending: 562689144s +4096
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
detected! [DISCARD L] new: 562689212s +2048; pending: 562689152s +32768
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
detected! [DISCARD L] new: 562689152s +2048; pending: 562689152s +32768
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
detected! [DISCARD L] new: 562689216s +2048; pending: 562689216s +24576
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
detected! [DISCARD L] new: 562689156s +8192; pending: 562689152s +32768
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
detected! [DISCARD L] new: 562689220s +28672; pending: 562689264s +8192
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
detected! [DISCARD L] new: 562689292s +8192; pending: 562689280s +32768
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
detected! [DISCARD L] new: 562689276s +2048; pending: 562689264s +8192
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
detected! [DISCARD L] new: 562689280s +2048; pending: 562689280s +32768
Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
detected! [DISCARD L] new: 562689284s +4096; pending: 562689280s +32768

I know it shows two SCST threads, but it's the same thing even if I disable
SCST threading (not to mention having the same thing happen with IET).

As I was about to send this email, I got another discovery: Concurrent local
writes do not happen when the DRBD device is disconnected. As soon as I
reconnect, they reappear, and this is as mentioned above using protocol C.
Do we need a protocol D now? :p

Thanks,

-Gennadiy


lars.ellenberg at linbit

Apr 21, 2009, 1:37 AM

Post #7 of 14 (2523 views)
Permalink
Re: Concurrent writes [In reply to]

On Mon, Apr 20, 2009 at 07:05:31PM -0400, Gennadiy Nerubayev wrote:
> On Mon, Apr 20, 2009 at 6:28 AM, Lars Ellenberg
> <lars.ellenberg [at] linbit>wrote:
>
> > On Mon, Apr 20, 2009 at 12:13:58PM +0200, Lars Ellenberg wrote:
> > >
> > > > 2. What's the worst case scenario (lost write? corrupt data? unknown
> > > > consistency?) that can result from concurrent writes?
> > >
> > > DRBD currently _drops_ the later write.
> > > if a write is detected as a concurrent local write,
> > > this one is never submitted nor send, and just "completed",
> > > pretending that it had been successfully written.
> > >
> > > we considered to _fail_ such writes with EIO,
> > > but decided to rather complain loudly, but pretend success.
> >
>
> So to clarify, in a typical scenario an initiator should not be issuing a
> write request while one for the same (or overlapping) block is not yet
> returned by DRBD as successful, however that's what happens?

yes.

possibly the target "announces" the equivalent of "tagged command
queueing" in iSCSI, and the initiator tries to take advantage of that,
but either target or initiator implement that incorrectly.
not sure how to verify this assumption, maybe using wireshark on the
iSCSI layer (which would also be a way to get to the actual data
of the overlapping requests).

> Is it at all possible that DRBD returns success earlier than it should
> have (obviously I'm using protocol C)?

No.
also DRBD protocol choice does not make a difference in this context.

simplified, the detection of these overlapping requests happens within
DRBD by a list walk. "pending" request objects get unliked from these
lists before they are completed to upper layers.


> please post a few of the original log lines,
> > they should read something like
> > <comm>[pid] Concurrent local write detected!
> > [DISCARD L] new: <sector offset>s +<size in bytes>;
> > pending: <sector offset>s +<size in bytes>
> >
> > I'm curious as to what the actual overlap is,
> > and in if there is any correlation between offsets.
>
>
> Here's an example for 8k random writes:
>
> Apr 14 12:24:35 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 162328976s +8192; pending: 162328976s +8192
> Apr 14 12:24:38 srpt1 kernel: drbd0: scsi_tgt0[17271] Concurrent local write
> detected! [DISCARD L] new: 161385248s +8192; pending: 161385248s +8192
> Apr 14 12:24:39 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 157655888s +8192; pending: 157655888s +8192
> Apr 14 12:24:39 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 165753872s +8192; pending: 165753872s +8192
> Apr 14 12:24:39 srpt1 kernel: drbd0: scsi_tgt0[17271] Concurrent local write
> detected! [DISCARD L] new: 166654816s +8192; pending: 166654816s +8192
> Apr 14 12:24:40 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 158260592s +8192; pending: 158260592s +8192
> Apr 14 12:24:40 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 163944704s +8192; pending: 163944704s +8192
> Apr 14 12:24:49 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 169511744s +8192; pending: 169511744s +8192
> Apr 14 12:24:51 srpt1 kernel: drbd0: scsi_tgt0[17271] Concurrent local write
> detected! [DISCARD L] new: 170614416s +8192; pending: 170614416s +8192
> Apr 14 12:24:52 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 158642368s +8192; pending: 158642368s +8192

so new and pending requests are in fact the verry same area.
interessting.

> 128k:
>
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
> detected! [DISCARD L] new: 562689092s +28672; pending: 562689144s +4096
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689172s +20480; pending: 562689152s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
> detected! [DISCARD L] new: 562689148s +2048; pending: 562689144s +4096
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689212s +2048; pending: 562689152s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
> detected! [DISCARD L] new: 562689152s +2048; pending: 562689152s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689216s +2048; pending: 562689216s +24576
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
> detected! [DISCARD L] new: 562689156s +8192; pending: 562689152s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689220s +28672; pending: 562689264s +8192
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
> detected! [DISCARD L] new: 562689292s +8192; pending: 562689280s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689276s +2048; pending: 562689264s +8192
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689280s +2048; pending: 562689280s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689284s +4096; pending: 562689280s +32768

these are overlapping, partially or completely.
some of the "new" offset/size tuples occur repeatedly.

also interessting.
I'm not sure though what we can make of that information.

> I know it shows two SCST threads, but it's the same thing even if I disable
> SCST threading (not to mention having the same thing happen with IET).
>
> As I was about to send this email, I got another discovery: Concurrent local
> writes do not happen when the DRBD device is disconnected.
>
> As soon as I reconnect, they reappear, and this is as mentioned above
> using protocol C.

sorry to disappoint you.
they are not checked for when disconnected ;(

data divergence due to conflicting (overlapping)
writes cannot happen when DRBD is not connected.
so in this case DRBD does not care.

the user is allowed to submit as much garbage to DRBD as it wants to.
DRBD "only" replicates whatever is submitted, and makes sure that
during normal operation, both replicas are bitwise identical.

That is the reason why DRBD complains loudly about conditions which make
this not possible in the general case, and enables "workarounds", so we
can hold up the "bitwise identical", even if that means we have to drop
such conflicting writes.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


parakie at gmail

Apr 21, 2009, 7:02 AM

Post #8 of 14 (2515 views)
Permalink
Re: Concurrent writes [In reply to]

On Tue, Apr 21, 2009 at 4:37 AM, Lars Ellenberg
<lars.ellenberg [at] linbit>wrote:

> On Mon, Apr 20, 2009 at 07:05:31PM -0400, Gennadiy Nerubayev wrote:
> > On Mon, Apr 20, 2009 at 6:28 AM, Lars Ellenberg
> > <lars.ellenberg [at] linbit>wrote:
> >
> > > On Mon, Apr 20, 2009 at 12:13:58PM +0200, Lars Ellenberg wrote:
> > > >
> possibly the target "announces" the equivalent of "tagged command
> queueing" in iSCSI, and the initiator tries to take advantage of that,
> but either target or initiator implement that incorrectly.
> not sure how to verify this assumption, maybe using wireshark on the
> iSCSI layer (which would also be a way to get to the actual data
> of the overlapping requests).


I'll try to get some more info about this, but I'm currently completely out
of ideas :(

data divergence due to conflicting (overlapping)
> writes cannot happen when DRBD is not connected.
> so in this case DRBD does not care.


Gah. But wait, you mentioned in the first email that "submitting a new write
request overlapping with an in flight write request is bad practice on any
IO subsystem, as it may violate write ordering, and the result in general is
undefined". So why don't we care about it in the standalone mode? Why can't
it happen when DRBD is disconnected? And if it can, why doesn't it cause
data corruption? I'm still trying to understand why this is not causing
issues for so many people that are running IET in blockio mode on standalone
targets (including those built on IET such as openfiler), yet when DRBD is
introduced, we run into this situation.

Sorry if it seems like I'm trying to single out DRBD as the culprit, but I
can't quite grasp why this only appears to be a problem on DRBD (paranoia
checking for the condition aside), and that the problem is big enough to
discard writes.

Thanks,

-Gennadiy


lars.ellenberg at linbit

Apr 21, 2009, 11:01 AM

Post #9 of 14 (2520 views)
Permalink
Re: Concurrent writes [In reply to]

On Tue, Apr 21, 2009 at 10:02:33AM -0400, Gennadiy Nerubayev wrote:
> > data divergence due to conflicting (overlapping)
> > writes cannot happen when DRBD is not connected.
> > so in this case DRBD does not care.
>
>
> Gah. But wait, you mentioned in the first email that "submitting a new write
> request overlapping with an in flight write request is bad practice on any
> IO subsystem, as it may violate write ordering, and the result in general is
> undefined". So why don't we care about it in the standalone mode? Why can't
> it happen when DRBD is disconnected?

of course it does happen.
but the possible data divergence due to different reordering on lower
layers cannot happen when we are not even replicating (disconnected).

> And if it can, why doesn't it cause data corruption?

it may, or may not.

my assumption is that it _does_ cause data corruption once in a while,
and no one ever notices.

but while disconnected, it cannot not cause that sort of corruption DRBD
cares about primarily: silent divergence of the replicas while DRBD
thinks they should be identical. that is why in the disconnected case,
we did not bother yet to check for this condition. this is easily
rectifiedt though: we can enable this paranoia check also in
disconnected mode, and voila, there are your kernel alerts again.

DRBD cannot protect you from data corruption.
if you write corrupt data to DRBD, or write data in a manner that may
cause it to end up on disk "unexpected", because of re-ordering of
requests on lower layers, drbd will happily replicate this corruption.
which is by design: DRBD is agnostic to the content of the data it
replicates.

> I'm still trying to understand why this is not causing issues for so
> many people that are running IET in blockio mode on standalone targets
> (including those built on IET such as openfiler), yet when DRBD is
> introduced, we run into this situation.

only DRBD does check these things.
only DRBD drops the latter, "conflicting" write.

> Sorry if it seems like I'm trying to single out DRBD as the culprit, but I
> can't quite grasp why this only appears to be a problem on DRBD (paranoia
> checking for the condition aside), and that the problem is big enough to
> discard writes.

sure, we could work around the brokenness (as circumstantial evidence suggests)
of the windows IO stack in DRBD. that is the beauty of open source.
(feature sponsoring accepted).

it could all be implemented differently.
I just state how it is, and why we did it this way.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


MRoof at admin

Apr 21, 2009, 11:47 AM

Post #10 of 14 (2522 views)
Permalink
Re: Concurrent writes [In reply to]

This is a really interesting discussion. I use DRBD to replicate
volumes that are exported via blockio with IET and have never gotten
this message. I currently use DRBD 8.0.16 and over the last year with
8.0.x series this message has never appeared. Before deployment all
sorts of io tests were conducted and this message wasn't present then
either.

So, I have an idea of a setting for you to change on your iSCSI target
system and I'm really curious if the message goes away. In IET we have
"InitialR2T Yes" but the default is "No". Trying running that way and
see what happens as I'm very curious about the results.

-
Morey Roof
Information Services Department
New Mexico Institute of Mining and Technology




-----Original Message-----
From: drbd-user-bounces [at] lists
[mailto:drbd-user-bounces [at] lists] On Behalf Of Lars Ellenberg
Sent: Tuesday, April 21, 2009 12:01 PM
To: drbd-user [at] lists
Subject: Re: [DRBD-user] Concurrent writes

On Tue, Apr 21, 2009 at 10:02:33AM -0400, Gennadiy Nerubayev wrote:
> > data divergence due to conflicting (overlapping) writes cannot
> > happen when DRBD is not connected.
> > so in this case DRBD does not care.
>
>
> Gah. But wait, you mentioned in the first email that "submitting a new

> write request overlapping with an in flight write request is bad
> practice on any IO subsystem, as it may violate write ordering, and
> the result in general is undefined". So why don't we care about it in
> the standalone mode? Why can't it happen when DRBD is disconnected?

of course it does happen.
but the possible data divergence due to different reordering on lower
layers cannot happen when we are not even replicating (disconnected).

> And if it can, why doesn't it cause data corruption?

it may, or may not.

my assumption is that it _does_ cause data corruption once in a while,
and no one ever notices.

but while disconnected, it cannot not cause that sort of corruption DRBD
cares about primarily: silent divergence of the replicas while DRBD
thinks they should be identical. that is why in the disconnected case,
we did not bother yet to check for this condition. this is easily
rectifiedt though: we can enable this paranoia check also in
disconnected mode, and voila, there are your kernel alerts again.

DRBD cannot protect you from data corruption.
if you write corrupt data to DRBD, or write data in a manner that may
cause it to end up on disk "unexpected", because of re-ordering of
requests on lower layers, drbd will happily replicate this corruption.
which is by design: DRBD is agnostic to the content of the data it
replicates.

> I'm still trying to understand why this is not causing issues for so
> many people that are running IET in blockio mode on standalone targets

> (including those built on IET such as openfiler), yet when DRBD is
> introduced, we run into this situation.

only DRBD does check these things.
only DRBD drops the latter, "conflicting" write.

> Sorry if it seems like I'm trying to single out DRBD as the culprit,
> but I can't quite grasp why this only appears to be a problem on DRBD
> (paranoia checking for the condition aside), and that the problem is
> big enough to discard writes.

sure, we could work around the brokenness (as circumstantial evidence
suggests) of the windows IO stack in DRBD. that is the beauty of open
source.
(feature sponsoring accepted).

it could all be implemented differently.
I just state how it is, and why we did it this way.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD(r) and LINBIT(r) are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


parakie at gmail

Apr 21, 2009, 2:18 PM

Post #11 of 14 (2512 views)
Permalink
Re: Concurrent writes [In reply to]

On Tue, Apr 21, 2009 at 2:47 PM, Roof, Morey R. <MRoof [at] admin> wrote:

> This is a really interesting discussion. I use DRBD to replicate
> volumes that are exported via blockio with IET and have never gotten
> this message. I currently use DRBD 8.0.16 and over the last year with
> 8.0.x series this message has never appeared. Before deployment all
> sorts of io tests were conducted and this message wasn't present then
> either.


Thanks for the input. What are the initiators that you're using?

So, I have an idea of a setting for you to change on your iSCSI target
> system and I'm really curious if the message goes away. In IET we have
> "InitialR2T Yes" but the default is "No". Trying running that way and
> see what happens as I'm very curious about the results.
>

Just tried "InitialR2T Yes", and there was no effect; the concurrent write
warnings are still being generated.

As a side note for anyone who would like to test this, on windows in
particular, here's one of tools I use to duplicate this:

SQLIO:
http://www.microsoft.com/downloads/details.aspx?familyid=9a8b005b-84e4-4f24-8d65-cb53442d9e19&displaylang=en
Contents of param.txt (this assumes f: is the remote storage, and you have
one gb free space available): f:\testfile.dat 4 0x0 1024
Command to run: sqlio -kW -s60 -frandom -o8 -b8 -LS -Fparam.txt

-Gennadiy


MRoof at admin

Apr 21, 2009, 2:39 PM

Post #12 of 14 (2508 views)
Permalink
Re: Concurrent writes [In reply to]

The Initiators I use are: VMWare ESX 3.0.x, VMWare ESX 3.5.x, Micrsoft
iSCSI Initiator 2.08, and Linux (RHEL 5, SUSE 10).

I gave SQLIO a run with your params and I'm not getting the concurrent
write issue.

The hardware I run DRBD is a pair of HP Proliant DL380 G4 servers with
12GB of RAM and P600 SAS controllers. The machines are currently
replicating 1.8TB of data. I run DRBD under SuSE 10SP2 and use IET
stock from SuSE but the DRBD is 8.0.16.


-
Morey Roof
Information Services Department
New Mexico Institute of Mining and Technology



________________________________

From: drbd-user-bounces [at] lists
[mailto:drbd-user-bounces [at] lists] On Behalf Of Gennadiy
Nerubayev
Sent: Tuesday, April 21, 2009 3:18 PM
To: drbd-user [at] lists
Subject: Re: [DRBD-user] Concurrent writes


On Tue, Apr 21, 2009 at 2:47 PM, Roof, Morey R. <MRoof [at] admin>
wrote:


This is a really interesting discussion. I use DRBD to
replicate
volumes that are exported via blockio with IET and have never
gotten
this message. I currently use DRBD 8.0.16 and over the last
year with
8.0.x series this message has never appeared. Before deployment
all
sorts of io tests were conducted and this message wasn't present
then
either.


Thanks for the input. What are the initiators that you're using?



So, I have an idea of a setting for you to change on your iSCSI
target
system and I'm really curious if the message goes away. In IET
we have
"InitialR2T Yes" but the default is "No". Trying running that
way and
see what happens as I'm very curious about the results.



Just tried "InitialR2T Yes", and there was no effect; the concurrent
write warnings are still being generated.

As a side note for anyone who would like to test this, on windows in
particular, here's one of tools I use to duplicate this:

SQLIO:
http://www.microsoft.com/downloads/details.aspx?familyid=9a8b005b-84e4-4
f24-8d65-cb53442d9e19&displaylang=en
Contents of param.txt (this assumes f: is the remote storage, and you
have one gb free space available): f:\testfile.dat 4 0x0 1024
Command to run: sqlio -kW -s60 -frandom -o8 -b8 -LS -Fparam.txt

-Gennadiy

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


parakie at gmail

Apr 21, 2009, 3:09 PM

Post #13 of 14 (2515 views)
Permalink
Re: Concurrent writes [In reply to]

On Tue, Apr 21, 2009 at 5:39 PM, Roof, Morey R. <MRoof [at] admin> wrote:

> The Initiators I use are: VMWare ESX 3.0.x, VMWare ESX 3.5.x, Micrsoft
> iSCSI Initiator 2.08, and Linux (RHEL 5, SUSE 10).
>
> I gave SQLIO a run with your params and I'm not getting the concurrent
> write issue.


The bizarreness continues; downgrading DRBD is of no help either :(
Although, it is encouraging in a way suggesting that this might be finally
isolated. I assume you're checking the logs of the primary node?

The hardware I run DRBD is a pair of HP Proliant DL380 G4 servers with
> 12GB of RAM and P600 SAS controllers. The machines are currently
> replicating 1.8TB of data. I run DRBD under SuSE 10SP2 and use IET
> stock from SuSE but the DRBD is 8.0.16.


What is the kernel version on the target? Version of Windows on the
initiator? Could you share the relevant portions of ietd.conf and drbd.conf?

Thanks,

-Gennadiy


florian.haas at linbit

Oct 5, 2009, 12:16 AM

Post #14 of 14 (1578 views)
Permalink
Re: Concurrent writes [In reply to]

Gennadiy,

I realize it's been a while since this issue was discussed last, but
we're still trying to hunt this down. I realize this is asking a lot,
but do you happen to still have the drbd.conf available from back in
April? It was unfortunately never posted in the thread.

Specifically, I'm curious as to whether you were using dual-Primary mode
at the time (allow-two-primaries).

Thanks!

Cheers,
Florian

On 2009-04-22 00:09, Gennadiy Nerubayev wrote:
> On Tue, Apr 21, 2009 at 5:39 PM, Roof, Morey R. <MRoof [at] admin
> <mailto:MRoof [at] admin>> wrote:
>
> The Initiators I use are: VMWare ESX 3.0.x, VMWare ESX 3.5.x, Micrsoft
> iSCSI Initiator 2.08, and Linux (RHEL 5, SUSE 10).
>
> I gave SQLIO a run with your params and I'm not getting the concurrent
> write issue.
>
>
> The bizarreness continues; downgrading DRBD is of no help either :(
> Although, it is encouraging in a way suggesting that this might be
> finally isolated. I assume you're checking the logs of the primary node?
>
> The hardware I run DRBD is a pair of HP Proliant DL380 G4 servers with
> 12GB of RAM and P600 SAS controllers. The machines are currently
> replicating 1.8TB of data. I run DRBD under SuSE 10SP2 and use IET
> stock from SuSE but the DRBD is 8.0.16.
>
>
> What is the kernel version on the target? Version of Windows on the
> initiator? Could you share the relevant portions of ietd.conf and drbd.conf?
>
> Thanks,
>
> -Gennadiy
Attachments: signature.asc (0.25 KB)

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.