Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

[PATCH] Describe race of direct read and fork for unaligned buffers

 

 

First page Previous page 1 2 Next page Last page  View All Linux kernel RSS feed   Index | Next | Previous | View Threaded


jack at suse

Apr 30, 2012, 2:30 AM

Post #1 of 31 (443 views)
Permalink
[PATCH] Describe race of direct read and fork for unaligned buffers

This is a long standing problem (or a surprising feature) in our implementation
of get_user_pages() (used by direct IO). Since several attempts to fix it
failed (e.g.
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
clear whether we really want to fix it given the costs, let's at least document
it.

CC: mgorman [at] suse
CC: Jeff Moyer <jmoyer [at] redhat>
Signed-off-by: Jan Kara <jack [at] suse>
---

--- a/man2/open.2 2012-04-27 00:07:51.736883092 +0200
+++ b/man2/open.2 2012-04-27 00:29:59.489892980 +0200
@@ -769,7 +769,12 @@
and the file offset must all be multiples of the logical block size
of the file system.
Under Linux 2.6, alignment to 512-byte boundaries
-suffices.
+suffices. However, if the user buffer is not page aligned and direct read
+runs in parallel with a
+.BR fork (2)
+of the reader process, it may happen that the read data is split between
+pages owned by the original process and its child. Thus effectively read
+data is corrupted.
.LP
The
.B O_DIRECT
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


jmoyer at redhat

Apr 30, 2012, 6:41 AM

Post #2 of 31 (442 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

Jan Kara <jack [at] suse> writes:

> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.
>
> CC: mgorman [at] suse
> CC: Jeff Moyer <jmoyer [at] redhat>
> Signed-off-by: Jan Kara <jack [at] suse>
> ---
>
> --- a/man2/open.2 2012-04-27 00:07:51.736883092 +0200
> +++ b/man2/open.2 2012-04-27 00:29:59.489892980 +0200
> @@ -769,7 +769,12 @@
> and the file offset must all be multiples of the logical block size
> of the file system.
> Under Linux 2.6, alignment to 512-byte boundaries
> -suffices.
> +suffices. However, if the user buffer is not page aligned and direct read
> +runs in parallel with a
> +.BR fork (2)
> +of the reader process, it may happen that the read data is split between
> +pages owned by the original process and its child. Thus effectively read
> +data is corrupted.
> .LP
> The
> .B O_DIRECT

I think this sufficiently distills the problem. Thanks, Jan.

Acked-by: Jeff Moyer <jmoyer [at] redhat>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


mgorman at suse

Apr 30, 2012, 7:30 AM

Post #3 of 31 (439 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Mon, Apr 30, 2012 at 11:30:07AM +0200, Jan Kara wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.
>
> CC: mgorman [at] suse
> CC: Jeff Moyer <jmoyer [at] redhat>
> Signed-off-by: Jan Kara <jack [at] suse>

Acked-by: Mel Gorman <mgorman [at] suse>

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


mtk.manpages at gmail

Apr 30, 2012, 10:50 PM

Post #4 of 31 (440 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

Jan,

On Mon, Apr 30, 2012 at 9:30 PM, Jan Kara <jack [at] suse> wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.
>
> CC: mgorman [at] suse
> CC: Jeff Moyer <jmoyer [at] redhat>
> Signed-off-by: Jan Kara <jack [at] suse>
> ---
>
> --- a/man2/open.2 2012-04-27 00:07:51.736883092 +0200
> +++ b/man2/open.2 2012-04-27 00:29:59.489892980 +0200
> @@ -769,7 +769,12 @@
> and the file offset must all be multiples of the logical block size
> of the file system.
> Under Linux 2.6, alignment to 512-byte boundaries
> -suffices.
> +suffices. However, if the user buffer is not page aligned and direct read
> +runs in parallel with a
> +.BR fork (2)
> +of the reader process, it may happen that the read data is split between
> +pages owned by the original process and its child. Thus effectively read
> +data is corrupted.
> .LP
> The
> .B O_DIRECT

Thanks. I tweaked the patch slightly, and applied as below.

Cheers,

Michael

--- a/man2/open.2
+++ b/man2/open.2
@@ -49,7 +49,7 @@
.\" FIXME Linux 2.6.33 has O_DSYNC, and a hidden __O_SYNC.
.\" FIXME: Linux 2.6.39 added O_PATH
.\"
-.TH OPEN 2 2012-02-27 "Linux" "Linux Programmer's Manual"
+.TH OPEN 2 2012-05-01 "Linux" "Linux Programmer's Manual"
.SH NAME
open, creat \- open and possibly create a file or device
.SH SYNOPSIS
@@ -768,8 +768,13 @@ operation in
Under Linux 2.4, transfer sizes, and the alignment of the user buffer
and the file offset must all be multiples of the logical block size
of the file system.
-Under Linux 2.6, alignment to 512-byte boundaries
-suffices.
+Under Linux 2.6, alignment to 512-byte boundaries suffices.
+However, if the user buffer is not page-aligned and the direct read
+runs in parallel with a
+.BR fork (2)
+of the reader process, it may happen that the read data is split between
+pages owned by the original process and its child.
+Thus the read data is effectively corrupted.
.LP
The
.B O_DIRECT


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


npiggin at gmail

Apr 30, 2012, 11:49 PM

Post #5 of 31 (439 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On 30 April 2012 19:30, Jan Kara <jack [at] suse> wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.

In any case, it should be documented even if it is ever fixed in newer
kernels. Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

May 1, 2012, 7:31 AM

Post #6 of 31 (432 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Mon, Apr 30, 2012 at 5:30 AM, Jan Kara <jack [at] suse> wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.
>
> CC: mgorman [at] suse
> CC: Jeff Moyer <jmoyer [at] redhat>
> Signed-off-by: Jan Kara <jack [at] suse>
> ---
>
> --- a/man2/open.2 2012-04-27 00:07:51.736883092 +0200
> +++ b/man2/open.2 2012-04-27 00:29:59.489892980 +0200
> @@ -769,7 +769,12 @@
> and the file offset must all be multiples of the logical block size
> of the file system.
> Under Linux 2.6, alignment to 512-byte boundaries
> -suffices.
> +suffices. However, if the user buffer is not page aligned and direct read
> +runs in parallel with a
> +.BR fork (2)
> +of the reader process, it may happen that the read data is split between
> +pages owned by the original process and its child. Thus effectively read
> +data is corrupted.
> .LP
> The
> .B O_DIRECT

Hello,

Thank you revisit this. But as far as my remember is correct, this issue is NOT
unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
DIRECT_IO w/ multi thread process should not use fork().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

May 1, 2012, 7:37 AM

Post #7 of 31 (433 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

> Hello,
>
> Thank you revisit this. But as far as my remember is correct, this issue is NOT
> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
> DIRECT_IO w/ multi thread process should not use fork().

The problem is, fork (and its COW logic) assume new access makes cow break,
But page table protection can't detect a DMA write. Therefore DIO may override
shared page data.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


jmoyer at redhat

May 1, 2012, 8:11 AM

Post #8 of 31 (431 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

KOSAKI Motohiro <kosaki.motohiro [at] gmail> writes:

>> Hello,
>>
>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>> DIRECT_IO w/ multi thread process should not use fork().
>
> The problem is, fork (and its COW logic) assume new access makes cow break,
> But page table protection can't detect a DMA write. Therefore DIO may override
> shared page data.

Hm, I've only seen this with misaligned or multiple sub-page-sized reads
in the same page. AFAIR, aligned, page-sized I/O does not get split.
But, I could be wrong...

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

May 1, 2012, 8:34 AM

Post #9 of 31 (433 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <jmoyer [at] redhat> wrote:
> KOSAKI Motohiro <kosaki.motohiro [at] gmail> writes:
>
>>> Hello,
>>>
>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>>> DIRECT_IO w/ multi thread process should not use fork().
>>
>> The problem is, fork (and its COW logic) assume new access makes cow break,
>> But page table protection can't detect a DMA write. Therefore DIO may override
>> shared page data.
>
> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
> in the same page. AFAIR, aligned, page-sized I/O does not get split.
> But, I could be wrong...

If my remember is correct, the reproducer of past thread is misleading.

dma_thread.c in
http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
align parameter. But it doesn't only change align. Because of, every
worker thread read 4K (pagesize), then
- when offset is page aligned
-> every page is accessed from only one worker
- when offset is not page aligned
-> every page is accessed from two workers

But I don't remember why two threads are important things. hmm.. I'm
looking into the code a while.
Please don't 100% trust me.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


jmoyer at redhat

May 1, 2012, 8:38 AM

Post #10 of 31 (434 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

KOSAKI Motohiro <kosaki.motohiro [at] gmail> writes:

> On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <jmoyer [at] redhat> wrote:
>> KOSAKI Motohiro <kosaki.motohiro [at] gmail> writes:
>>
>>>> Hello,
>>>>
>>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>>>> DIRECT_IO w/ multi thread process should not use fork().
>>>
>>> The problem is, fork (and its COW logic) assume new access makes cow break,
>>> But page table protection can't detect a DMA write. Therefore DIO may override
>>> shared page data.
>>
>> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
>> in the same page.  AFAIR, aligned, page-sized I/O does not get split.
>> But, I could be wrong...
>
> If my remember is correct, the reproducer of past thread is misleading.
>
> dma_thread.c in
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
> align parameter. But it doesn't only change align. Because of, every
> worker thread read 4K (pagesize), then
> - when offset is page aligned
> -> every page is accessed from only one worker
> - when offset is not page aligned
> -> every page is accessed from two workers
>
> But I don't remember why two threads are important things. hmm.. I'm
> looking into the code a while.
> Please don't 100% trust me.

I bet Andrea or Larry would remember the details.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


npiggin at gmail

May 1, 2012, 8:50 AM

Post #11 of 31 (432 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On 2 May 2012 01:38, Jeff Moyer <jmoyer [at] redhat> wrote:
> KOSAKI Motohiro <kosaki.motohiro [at] gmail> writes:
>
>> On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <jmoyer [at] redhat> wrote:
>>> KOSAKI Motohiro <kosaki.motohiro [at] gmail> writes:
>>>
>>>>> Hello,
>>>>>
>>>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>>>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>>>>> DIRECT_IO w/ multi thread process should not use fork().
>>>>
>>>> The problem is, fork (and its COW logic) assume new access makes cow break,
>>>> But page table protection can't detect a DMA write. Therefore DIO may override
>>>> shared page data.
>>>
>>> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
>>> in the same page.  AFAIR, aligned, page-sized I/O does not get split.
>>> But, I could be wrong...
>>
>> If my remember is correct, the reproducer of past thread is misleading.
>>
>> dma_thread.c in
>> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
>> align parameter. But it doesn't only change align. Because of, every
>> worker thread read 4K (pagesize), then
>>  - when offset is page aligned
>>     -> every page is accessed from only one worker
>>  - when offset is not page aligned
>>     -> every page is accessed from two workers
>>
>> But I don't remember why two threads are important things. hmm.. I'm
>> looking into the code a while.
>> Please don't 100% trust me.
>
> I bet Andrea or Larry would remember the details.

KOSAKI-san is correct, I think.

The race is something like this:

DIO-read
page = get_user_pages()
fork()
COW(page)
touch(page)
DMA(page)
page_cache_release(page);

So whether parent or child touches the page, determines who gets the
actual DMA target, and who gets the copy.

2 threads are not required, but it makes the race easier to code and a
larger window, I suspect.

It can also be hit with a single thread, using AIO.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

May 1, 2012, 9:15 AM

Post #12 of 31 (431 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

> +suffices. However, if the user buffer is not page aligned and direct read

One more thing. direct write also makes data corruption. Think
following scenario,

1) P1-T1 uses DIO write (and starting dma)
2) P1-T2 call fork() and makes P2
3) P1-T3 write to the dio target page. and then, cow break occur and
original dio target
pages is now owned by P2.
4) P2 write the dio target page. It now does NOT make cow break. and
now we break
dio target page data.
5) DMA transfer write invalid data to disk.

The detail is described in your refer URLs.


> +runs in parallel with a
> +.BR fork (2)
> +of the reader process, it may happen that the read data is split between
> +pages owned by the original process and its child. Thus effectively read
> +data is corrupted.
> .LP
> The
> .B O_DIRECT
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


mtk.manpages at gmail

May 1, 2012, 10:56 AM

Post #13 of 31 (432 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Wed, May 2, 2012 at 4:15 AM, KOSAKI Motohiro
<kosaki.motohiro [at] gmail> wrote:
>> +suffices. However, if the user buffer is not page aligned and direct read
>
> One more thing. direct write also makes data corruption. Think
> following scenario,

In the light of all of the comments, can someone revise the man-pages
patch that Jan sent?

Thanks,

Michael


> 1) P1-T1 uses DIO write (and starting dma)
> 2) P1-T2 call fork() and makes P2
> 3) P1-T3 write to the dio target page. and then, cow break occur and
> original dio target
> pages is now owned by P2.
> 4) P2 write the dio target page. It now does NOT make cow break. and
> now we break
> dio target page data.
> 5) DMA transfer write invalid data to disk.
>
> The detail is described in your refer URLs.
>
>
>> +runs in parallel with a
>> +.BR fork (2)
>> +of the reader process, it may happen that the read data is split between
>> +pages owned by the original process and its child. Thus effectively read
>> +data is corrupted.
>> .LP
>> The
>> .B O_DIRECT



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


aarcange at redhat

May 1, 2012, 4:51 PM

Post #14 of 31 (434 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

Hi Nick!

On Wed, May 02, 2012 at 01:50:46AM +1000, Nick Piggin wrote:
> KOSAKI-san is correct, I think.
>
> The race is something like this:
>
> DIO-read
> page = get_user_pages()
> fork()
> COW(page)
> touch(page)
> DMA(page)
> page_cache_release(page);

Yes. More in general this race happens every time the kernel wrprotect
a writable anon pte, if get_user_pages had a pin on the page while the
pte is being wrprotected.

fork can't just abort (like KSM does) when it notices mapcount <
page_count.

The only way to avoid this, is that somehow the GUP-pinned page should
remain pointed at all times by the pte of the process that pinned the
page (no matter the cows), and that's not happening.

> So whether parent or child touches the page, determines who gets the
> actual DMA target, and who gets the copy.

Correct, so far there are two reproducers, triggering two different
kind of corruption.

The corruption may appear in different ways:

1) we could lose the direct-io read in the parent (if the forked child
does nothing and just quits), that was the basic case in dma_thread.c,
a dummy fork was run just to mark the pte wrprotected

2) the destination of the direct-io read may also become visible to the
child if the child written to the page before the I/O is complete,
leading to random mm corruption in the child

3) it's a direct-io write, then the child could write random data to
disk by accident without noticing, if the DMA wasn't started yet and
the child got the pinned page mapped in the child pte

We had two working fixes for this and personally I'd prefer to apply
them than to document the bug. The probability that who writes code
that can hit the bug is reading the note in the manpage seems pretty
small, especially in the short/mid term. This lkml thread as reminder
may actually have higher chance of being noticed than the manpage
maybe. Nevertheless documenting it is better than nothing if the fixes
aren't applied :). However I'm afraid after we officially document it
the chances of fixing it becomes zero.

> 2 threads are not required, but it makes the race easier to code and a
> larger window, I suspect.
>
> It can also be hit with a single thread, using AIO.

Yes, it requires running fork in the same process that pinned a page
with GUP, and then writing to a buffer in the same page that is under
the GUP pin before the GUP pin is released.

It's not just direct-io, and not just direct-io read (see point 3).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


npiggin at gmail

May 1, 2012, 5:34 PM

Post #15 of 31 (431 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On 2 May 2012 03:56, Michael Kerrisk (man-pages) <mtk.manpages [at] gmail> wrote:
> On Wed, May 2, 2012 at 4:15 AM, KOSAKI Motohiro
> <kosaki.motohiro [at] gmail> wrote:
>>> +suffices. However, if the user buffer is not page aligned and direct read
>>
>> One more thing. direct write also makes data corruption. Think
>> following scenario,
>
> In the light of all of the comments, can someone revise the man-pages
> patch that Jan sent?

This does not quite describe the entire situation, but something understandable
to developers:

O_DIRECT IOs should never be run concurrently with fork(2) system call,
when the memory buffer is anonymous memory, or comes from mmap(2)
with MAP_PRIVATE.

Any such IOs, whether submitted with asynchronous IO interface or from
another thread in the process, should be quiesced before fork(2) is called.
Failure to do so can result in data corruption and undefined behavior in
parent and child processes.

This restriction does not apply when the memory buffer for the O_DIRECT
IOs comes from mmap(2) with MAP_SHARED or from shmat(2).



Is that on the right track? I feel it might be necessary to describe this
allowance for MAP_SHARED, because some databases may be doing
such things, and anyway it gives apps a potential way to make this work
if concurrent fork + DIO is very important.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


hughd at google

May 1, 2012, 8:04 PM

Post #16 of 31 (432 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Wed, 2 May 2012, Nick Piggin wrote:
> On 2 May 2012 03:56, Michael Kerrisk (man-pages) <mtk.manpages [at] gmail> wrote:
> >
> > In the light of all of the comments, can someone revise the man-pages
> > patch that Jan sent?
>
> This does not quite describe the entire situation, but something understandable
> to developers:
>
> O_DIRECT IOs should never be run concurrently with fork(2) system call,
> when the memory buffer is anonymous memory, or comes from mmap(2)
> with MAP_PRIVATE.
>
> Any such IOs, whether submitted with asynchronous IO interface or from
> another thread in the process, should be quiesced before fork(2) is called.
> Failure to do so can result in data corruption and undefined behavior in
> parent and child processes.
>
> This restriction does not apply when the memory buffer for the O_DIRECT
> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).

Nor does this restriction apply when the memory buffer has been advised
as MADV_DONTFORK with madvise(2), ensuring that it will not be available
to the child after fork(2).

>
>
>
> Is that on the right track? I feel it might be necessary to describe this
> allowance for MAP_SHARED, because some databases may be doing
> such things, and anyway it gives apps a potential way to make this work
> if concurrent fork + DIO is very important.

Looks good, but we do need a reference to MADV_DONTFORK, perhaps as above.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


npiggin at gmail

May 1, 2012, 8:10 PM

Post #17 of 31 (435 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On 2 May 2012 13:04, Hugh Dickins <hughd [at] google> wrote:
> On Wed, 2 May 2012, Nick Piggin wrote:
>> On 2 May 2012 03:56, Michael Kerrisk (man-pages) <mtk.manpages [at] gmail> wrote:
>> >
>> > In the light of all of the comments, can someone revise the man-pages
>> > patch that Jan sent?
>>
>> This does not quite describe the entire situation, but something understandable
>> to developers:
>>
>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
>> when the memory buffer is anonymous memory, or comes from mmap(2)
>> with MAP_PRIVATE.
>>
>> Any such IOs, whether submitted with asynchronous IO interface or from
>> another thread in the process, should be quiesced before fork(2) is called.
>> Failure to do so can result in data corruption and undefined behavior in
>> parent and child processes.
>>
>> This restriction does not apply when the memory buffer for the O_DIRECT
>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>
> Nor does this restriction apply when the memory buffer has been advised
> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
> to the child after fork(2).

Yes of course, I forgot that was exported too.

>
>>
>>
>>
>> Is that on the right track? I feel it might be necessary to describe this
>> allowance for MAP_SHARED, because some databases may be doing
>> such things, and anyway it gives apps a potential way to make this work
>> if concurrent fork + DIO is very important.
>
> Looks good, but we do need a reference to MADV_DONTFORK, perhaps as above.

Yep, thanks Hugh.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


jack at suse

May 2, 2012, 1:17 AM

Post #18 of 31 (422 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Wed 02-05-12 01:50:46, Nick Piggin wrote:
> On 2 May 2012 01:38, Jeff Moyer <jmoyer [at] redhat> wrote:
> > KOSAKI Motohiro <kosaki.motohiro [at] gmail> writes:
> >
> >> On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <jmoyer [at] redhat> wrote:
> >>> KOSAKI Motohiro <kosaki.motohiro [at] gmail> writes:
> >>>
> >>>>> Hello,
> >>>>>
> >>>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
> >>>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
> >>>>> DIRECT_IO w/ multi thread process should not use fork().
> >>>>
> >>>> The problem is, fork (and its COW logic) assume new access makes cow break,
> >>>> But page table protection can't detect a DMA write. Therefore DIO may override
> >>>> shared page data.
> >>>
> >>> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
> >>> in the same page. AFAIR, aligned, page-sized I/O does not get split.
> >>> But, I could be wrong...
> >>
> >> If my remember is correct, the reproducer of past thread is misleading.
> >>
> >> dma_thread.c in
> >> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
> >> align parameter. But it doesn't only change align. Because of, every
> >> worker thread read 4K (pagesize), then
> >> - when offset is page aligned
> >> -> every page is accessed from only one worker
> >> - when offset is not page aligned
> >> -> every page is accessed from two workers
> >>
> >> But I don't remember why two threads are important things. hmm.. I'm
> >> looking into the code a while.
> >> Please don't 100% trust me.
> >
> > I bet Andrea or Larry would remember the details.
>
> KOSAKI-san is correct, I think.
>
> The race is something like this:
>
> DIO-read
> page = get_user_pages()
> fork()
> COW(page)
> touch(page)
> DMA(page)
> page_cache_release(page);
>
> So whether parent or child touches the page, determines who gets the
> actual DMA target, and who gets the copy.
OK, this is roughly what I understood from original threads as well. So
if our buffer is page aligned and its size is page aligned, you would hit
the corruption only if you do modify the buffer while IO to / from that buffer
is in progress. And that would seem like a really bad programming practice
anyway. So I still believe that having everything page size aligned will
effectively remove the problem although I agree it does not aim at the core
of it.

Honza
--
Jan Kara <jack [at] suse>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


npiggin at gmail

May 2, 2012, 2:09 AM

Post #19 of 31 (418 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On 2 May 2012 18:17, Jan Kara <jack [at] suse> wrote:
> On Wed 02-05-12 01:50:46, Nick Piggin wrote:

>> KOSAKI-san is correct, I think.
>>
>> The race is something like this:
>>
>> DIO-read
>>     page = get_user_pages()
>>                                                         fork()
>>                                                             COW(page)
>>                                                          touch(page)
>>     DMA(page)
>>     page_cache_release(page);
>>
>> So whether parent or child touches the page, determines who gets the
>> actual DMA target, and who gets the copy.
>  OK, this is roughly what I understood from original threads as well. So
> if our buffer is page aligned and its size is page aligned, you would hit
> the corruption only if you do modify the buffer while IO to / from that buffer
> is in progress. And that would seem like a really bad programming practice
> anyway. So I still believe that having everything page size aligned will
> effectively remove the problem although I agree it does not aim at the core
> of it.

I see what you mean.

I'm not sure, though. For most apps it's bad practice I think. If you get into
realm of sophisticated, performance critical IO/storage managers, it would
not surprise me if such concurrent buffer modifications could be allowed.
We allow exactly such a thing in our pagecache layer. Although probably
those would be using shared mmaps for their buffer cache.

I think it is safest to make a default policy of asking for IOs against private
cow-able mappings to be quiesced before fork, so there are no surprises
or reliance on COW details in the mm. Do you think?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


jack at suse

May 2, 2012, 2:18 AM

Post #20 of 31 (422 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Wed 02-05-12 19:09:54, Nick Piggin wrote:
> On 2 May 2012 18:17, Jan Kara <jack [at] suse> wrote:
> > On Wed 02-05-12 01:50:46, Nick Piggin wrote:
>
> >> KOSAKI-san is correct, I think.
> >>
> >> The race is something like this:
> >>
> >> DIO-read
> >> page = get_user_pages()
> >> fork()
> >> COW(page)
> >> touch(page)
> >> DMA(page)
> >> page_cache_release(page);
> >>
> >> So whether parent or child touches the page, determines who gets the
> >> actual DMA target, and who gets the copy.
> > OK, this is roughly what I understood from original threads as well. So
> > if our buffer is page aligned and its size is page aligned, you would hit
> > the corruption only if you do modify the buffer while IO to / from that buffer
> > is in progress. And that would seem like a really bad programming practice
> > anyway. So I still believe that having everything page size aligned will
> > effectively remove the problem although I agree it does not aim at the core
> > of it.
>
> I see what you mean.
>
> I'm not sure, though. For most apps it's bad practice I think. If you get into
> realm of sophisticated, performance critical IO/storage managers, it would
> not surprise me if such concurrent buffer modifications could be allowed.
> We allow exactly such a thing in our pagecache layer. Although probably
> those would be using shared mmaps for their buffer cache.
>
> I think it is safest to make a default policy of asking for IOs against private
> cow-able mappings to be quiesced before fork, so there are no surprises
> or reliance on COW details in the mm. Do you think?
Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
in documentation. Otherwise it's a bit too hairy...

Honza
--
Jan Kara <jack [at] suse>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


jack at suse

May 2, 2012, 2:20 AM

Post #21 of 31 (418 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Tue 01-05-12 20:04:15, Hugh Dickins wrote:
> On Wed, 2 May 2012, Nick Piggin wrote:
> > On 2 May 2012 03:56, Michael Kerrisk (man-pages) <mtk.manpages [at] gmail> wrote:
> > >
> > > In the light of all of the comments, can someone revise the man-pages
> > > patch that Jan sent?
> >
> > This does not quite describe the entire situation, but something understandable
> > to developers:
> >
> > O_DIRECT IOs should never be run concurrently with fork(2) system call,
> > when the memory buffer is anonymous memory, or comes from mmap(2)
> > with MAP_PRIVATE.
> >
> > Any such IOs, whether submitted with asynchronous IO interface or from
> > another thread in the process, should be quiesced before fork(2) is called.
> > Failure to do so can result in data corruption and undefined behavior in
> > parent and child processes.
> >
> > This restriction does not apply when the memory buffer for the O_DIRECT
> > IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>
> Nor does this restriction apply when the memory buffer has been advised
> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
> to the child after fork(2).
Yes, I think with this addition the text is fine.

Honza
--
Jan Kara <jack [at] suse>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

May 2, 2012, 12:14 PM

Post #22 of 31 (416 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

Hello,

>> I see what you mean.
>>
>> I'm not sure, though. For most apps it's bad practice I think. If you get into
>> realm of sophisticated, performance critical IO/storage managers, it would
>> not surprise me if such concurrent buffer modifications could be allowed.
>> We allow exactly such a thing in our pagecache layer. Although probably
>> those would be using shared mmaps for their buffer cache.
>>
>> I think it is safest to make a default policy of asking for IOs against private
>> cow-able mappings to be quiesced before fork, so there are no surprises
>> or reliance on COW details in the mm. Do you think?
> Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
> in documentation. Otherwise it's a bit too hairy...

I neglected this issue for years because Linus asked who need this and
I couldn't
find real world usecase.

Ah, no, not exactly correct. Fujitsu proprietary database had such
usecase. But they
quickly fixed it. Then I couldn't find alternative usecase.

I'm not sure why you say "hairy". Do you mean you have any use case of this?

Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


jack at suse

May 2, 2012, 12:23 PM

Post #23 of 31 (415 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Wed 02-05-12 15:14:33, KOSAKI Motohiro wrote:
> Hello,
>
> >> I see what you mean.
> >>
> >> I'm not sure, though. For most apps it's bad practice I think. If you get into
> >> realm of sophisticated, performance critical IO/storage managers, it would
> >> not surprise me if such concurrent buffer modifications could be allowed.
> >> We allow exactly such a thing in our pagecache layer. Although probably
> >> those would be using shared mmaps for their buffer cache.
> >>
> >> I think it is safest to make a default policy of asking for IOs against private
> >> cow-able mappings to be quiesced before fork, so there are no surprises
> >> or reliance on COW details in the mm. Do you think?
> > Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
> > in documentation. Otherwise it's a bit too hairy...
>
> I neglected this issue for years because Linus asked who need this and
> I couldn't
> find real world usecase.
>
> Ah, no, not exactly correct. Fujitsu proprietary database had such
> usecase. But they quickly fixed it. Then I couldn't find alternative usecase.
One of our customers hit this bug recently which is why I started to look
at this. But they also modified their application not to hit the problem.

> I'm not sure why you say "hairy". Do you mean you have any use case of this?
I meant that if we should describe conditions like "if you have page
aligned buffer and you don't write to it while the IO is running, the
problem also won't occur", then it's already too detailed and might
easily change in future kernels...

Honza
--
Jan Kara <jack [at] suse>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

May 2, 2012, 12:25 PM

Post #24 of 31 (417 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Wed, May 2, 2012 at 3:23 PM, Jan Kara <jack [at] suse> wrote:
> On Wed 02-05-12 15:14:33, KOSAKI Motohiro wrote:
>> Hello,
>>
>> >> I see what you mean.
>> >>
>> >> I'm not sure, though. For most apps it's bad practice I think. If you get into
>> >> realm of sophisticated, performance critical IO/storage managers, it would
>> >> not surprise me if such concurrent buffer modifications could be allowed.
>> >> We allow exactly such a thing in our pagecache layer. Although probably
>> >> those would be using shared mmaps for their buffer cache.
>> >>
>> >> I think it is safest to make a default policy of asking for IOs against private
>> >> cow-able mappings to be quiesced before fork, so there are no surprises
>> >> or reliance on COW details in the mm. Do you think?
>> > Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
>> > in documentation. Otherwise it's a bit too hairy...
>>
>> I neglected this issue for years because Linus asked who need this and
>> I couldn't
>> find real world usecase.
>>
>> Ah, no, not exactly correct. Fujitsu proprietary database had such
>> usecase. But they quickly fixed it. Then I couldn't find alternative usecase.
> One of our customers hit this bug recently which is why I started to look
> at this. But they also modified their application not to hit the problem.
>
>> I'm not sure why you say "hairy". Do you mean you have any use case of this?
> I meant that if we should describe conditions like "if you have page
> aligned buffer and you don't write to it while the IO is running, the
> problem also won't occur", then it's already too detailed and might
> easily change in future kernels...

ok, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


mtk.manpages at gmail

May 5, 2012, 4:28 AM

Post #25 of 31 (408 views)
Permalink
Re: [PATCH] Describe race of direct read and fork for unaligned buffers [In reply to]

On Thu, May 3, 2012 at 7:25 AM, KOSAKI Motohiro
<kosaki.motohiro [at] gmail> wrote:
> On Wed, May 2, 2012 at 3:23 PM, Jan Kara <jack [at] suse> wrote:
>> On Wed 02-05-12 15:14:33, KOSAKI Motohiro wrote:
>>> Hello,
>>>
>>> >> I see what you mean.
>>> >>
>>> >> I'm not sure, though. For most apps it's bad practice I think. If you get into
>>> >> realm of sophisticated, performance critical IO/storage managers, it would
>>> >> not surprise me if such concurrent buffer modifications could be allowed.
>>> >> We allow exactly such a thing in our pagecache layer. Although probably
>>> >> those would be using shared mmaps for their buffer cache.
>>> >>
>>> >> I think it is safest to make a default policy of asking for IOs against private
>>> >> cow-able mappings to be quiesced before fork, so there are no surprises
>>> >> or reliance on COW details in the mm. Do you think?
>>> > Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
>>> > in documentation. Otherwise it's a bit too hairy...
>>>
>>> I neglected this issue for years because Linus asked who need this and
>>> I couldn't
>>> find real world usecase.
>>>
>>> Ah, no, not exactly correct. Fujitsu proprietary database had such
>>> usecase. But they quickly fixed it. Then I couldn't find alternative usecase.
>> One of our customers hit this bug recently which is why I started to look
>> at this. But they also modified their application not to hit the problem.
>>
>>> I'm not sure why you say "hairy". Do you mean you have any use case of this?
>> I meant that if we should describe conditions like "if you have page
>> aligned buffer and you don't write to it while the IO is running, the
>> problem also won't occur", then it's already too detailed and might
>> easily change in future kernels...

So, am I correct to assume that right text to add to the page is as below?

Nick, can you clarify what you mean by "quiesced"?

[.[.
O_DIRECT IOs should never be run concurrently with fork(2) system call,
when the memory buffer is anonymous memory, or comes from mmap(2)
with MAP_PRIVATE.

Any such IOs, whether submitted with asynchronous IO interface or from
another thread in the process, should be quiesced before fork(2) is called.
Failure to do so can result in data corruption and undefined behavior in
parent and child processes.

This restriction does not apply when the memory buffer for the O_DIRECT
IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
Nor does this restriction apply when the memory buffer has been advised
as MADV_DONTFORK with madvise(2), ensuring that it will not be available
to the child after fork(2).
]]

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First page Previous page 1 2 Next page Last page  View All Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.