Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

application syncing options (was Re: [PATCH] Memory management livelock)

 

 

Linux kernel RSS feed   Index | Next | Previous | View Threaded


david at lang

Oct 3, 2008, 8:52 AM

Post #1 of 7 (1118 views)
Permalink
application syncing options (was Re: [PATCH] Memory management livelock)

On Fri, 3 Oct 2008, Nick Piggin wrote:

>> *What* is, forever? Data integrity syncs should have pages operated on
>> in-order, until we get to the end of the range. Circular writeback could
>> go through again, possibly, but no more than once.
>
> OK, I have been able to reproduce it somewhat. It is not a livelock,
> but what is happening is that direct IO read basically does an fsync
> on the file before performing the IO. The fsync gets stuck behind the
> dd that is dirtying the pages, and ends up following behind it and
> doing all its IO for it.
>
> The following patch avoids the issue for direct IO, by using the range
> syncs rather than trying to sync the whole file.
>
> The underlying problem I guess is unchanged. Is it really a problem,
> though? The way I'd love to solve it is actually by adding another bit
> or two to the pagecache radix tree, that can be used to transiently tag
> the tree for future operations. That way we could record the dirty and
> writeback pages up front, and then only bother with operating on them.
>
> That's *if* it really is a problem. I don't have much pity for someone
> doing buffered IO and direct IO to the same pages of the same file :)

I've seen lots of discussions here about different options in syncing. in
this case a fix is to do a fsync of a range. I've also seen discussions of
how the kernel filesystem code can do ordered writes without having to
wait for them with the use of barriers, is this capability exported to
userspace? if so, could you point me at documentation for it?

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


mpatocka at redhat

Oct 5, 2008, 5:04 PM

Post #2 of 7 (1033 views)
Permalink
Re: application syncing options (was Re: [PATCH] Memory management livelock) [In reply to]

On Fri, 3 Oct 2008, david [at] lang wrote:

> On Fri, 3 Oct 2008, Nick Piggin wrote:
>
> > > *What* is, forever? Data integrity syncs should have pages operated on
> > > in-order, until we get to the end of the range. Circular writeback could
> > > go through again, possibly, but no more than once.
> >
> > OK, I have been able to reproduce it somewhat. It is not a livelock,
> > but what is happening is that direct IO read basically does an fsync
> > on the file before performing the IO. The fsync gets stuck behind the
> > dd that is dirtying the pages, and ends up following behind it and
> > doing all its IO for it.
> >
> > The following patch avoids the issue for direct IO, by using the range
> > syncs rather than trying to sync the whole file.
> >
> > The underlying problem I guess is unchanged. Is it really a problem,
> > though? The way I'd love to solve it is actually by adding another bit
> > or two to the pagecache radix tree, that can be used to transiently tag
> > the tree for future operations. That way we could record the dirty and
> > writeback pages up front, and then only bother with operating on them.
> >
> > That's *if* it really is a problem. I don't have much pity for someone
> > doing buffered IO and direct IO to the same pages of the same file :)
>
> I've seen lots of discussions here about different options in syncing. in this
> case a fix is to do a fsync of a range.

It fixes the bug in concurrent direct read+buffed write, but won't fix the
bug with concurrent sync+buffered write.

> I've also seen discussions of how the
> kernel filesystem code can do ordered writes without having to wait for them
> with the use of barriers, is this capability exported to userspace? if so,
> could you point me at documentation for it?

It isn't. And it is good that it isn't --- the more complicated API, the
more maintenance work.

Mikulas

> David Lang
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


david at lang

Oct 5, 2008, 5:19 PM

Post #3 of 7 (1047 views)
Permalink
Re: application syncing options (was Re: [PATCH] Memory management livelock) [In reply to]

On Sun, 5 Oct 2008, Mikulas Patocka wrote:

> On Fri, 3 Oct 2008, david [at] lang wrote:
>
>> I've also seen discussions of how the
>> kernel filesystem code can do ordered writes without having to wait for them
>> with the use of barriers, is this capability exported to userspace? if so,
>> could you point me at documentation for it?
>
> It isn't. And it is good that it isn't --- the more complicated API, the
> more maintenance work.

I can understand that most software would not want to deal with
complications like this, but for things thta have requirements similar to
journaling filesystems (databases for example) it would seem that there
would be advantages to exposing this capabilities.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


mpatocka at redhat

Oct 5, 2008, 8:42 PM

Post #4 of 7 (1030 views)
Permalink
Re: application syncing options (was Re: [PATCH] Memory management livelock) [In reply to]

On Sun, 5 Oct 2008, david [at] lang wrote:

> On Sun, 5 Oct 2008, Mikulas Patocka wrote:
>
> > On Fri, 3 Oct 2008, david [at] lang wrote:
> >
> > > I've also seen discussions of how the
> > > kernel filesystem code can do ordered writes without having to wait for
> > > them
> > > with the use of barriers, is this capability exported to userspace? if so,
> > > could you point me at documentation for it?
> >
> > It isn't. And it is good that it isn't --- the more complicated API, the
> > more maintenance work.
>
> I can understand that most software would not want to deal with complications
> like this, but for things thta have requirements similar to journaling
> filesystems (databases for example) it would seem that there would be
> advantages to exposing this capabilities.
>
> David Lang

If you invent new interface that allows submitting several ordered IOs
from userspace, it will require excessive maintenance overhead over long
period of time. So it should be only justified, if the performance
improvement is excessive as well.

It should not be like "here you improve 10% performance on some synthetic
benchmark in one application that was rewritten to support the new
interface" and then create a few more security vulnerabilities (because of
the complexity of the interface) and damage overall Linux progress,
because everyone is catching bugs in the new interface and checking it for
correctness.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


david at lang

Oct 6, 2008, 8:37 PM

Post #5 of 7 (1027 views)
Permalink
Re: application syncing options (was Re: [PATCH] Memory management livelock) [In reply to]

On Sun, 5 Oct 2008, Mikulas Patocka wrote:

> On Sun, 5 Oct 2008, david [at] lang wrote:
>
>> On Sun, 5 Oct 2008, Mikulas Patocka wrote:
>>
>>> On Fri, 3 Oct 2008, david [at] lang wrote:
>>>
>>>> I've also seen discussions of how the
>>>> kernel filesystem code can do ordered writes without having to wait for
>>>> them
>>>> with the use of barriers, is this capability exported to userspace? if so,
>>>> could you point me at documentation for it?
>>>
>>> It isn't. And it is good that it isn't --- the more complicated API, the
>>> more maintenance work.
>>
>> I can understand that most software would not want to deal with complications
>> like this, but for things thta have requirements similar to journaling
>> filesystems (databases for example) it would seem that there would be
>> advantages to exposing this capabilities.
>>
>> David Lang
>
> If you invent new interface that allows submitting several ordered IOs
> from userspace, it will require excessive maintenance overhead over long
> period of time. So it should be only justified, if the performance
> improvement is excessive as well.
>
> It should not be like "here you improve 10% performance on some synthetic
> benchmark in one application that was rewritten to support the new
> interface" and then create a few more security vulnerabilities (because of
> the complexity of the interface) and damage overall Linux progress,
> because everyone is catching bugs in the new interface and checking it for
> correctness.

the same benchmarks that show that it's far better for the in-kernel
filesystem code to use write barriers should apply for FUSE filesystems.

this isn't a matter of a few % in performance, if an application is
sync-limited in a way that can be converted to write-ordered the potential
is for the application to speed up my many times.

programs that maintain indexes or caches of data that lives in other files
will be able to write data && barrier && write index && fsync and double
their performance vs write data && fsync && write index && fsync

databases can potentially do even better, today they need to fsync data to
disk before they can update their journal to indicate that the data has
been written, with a barrier they could order the writes so that the write
to the journal doesn't happen until the writes of the data. they would
neve need to call an fsync at all (when emptying the journal)

for systems without solid-state drives or battery-backed caches, the
ability to eliminate fsyncs by being able to rely on the order of the
writes is a huge benifit.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


mpatocka at redhat

Oct 7, 2008, 8:44 AM

Post #6 of 7 (1038 views)
Permalink
Re: application syncing options (was Re: [PATCH] Memory management livelock) [In reply to]

> > If you invent new interface that allows submitting several ordered IOs
> > from userspace, it will require excessive maintenance overhead over long
> > period of time. So it should be only justified, if the performance
> > improvement is excessive as well.
> >
> > It should not be like "here you improve 10% performance on some synthetic
> > benchmark in one application that was rewritten to support the new
> > interface" and then create a few more security vulnerabilities (because of
> > the complexity of the interface) and damage overall Linux progress,
> > because everyone is catching bugs in the new interface and checking it for
> > correctness.
>
> the same benchmarks that show that it's far better for the in-kernel
> filesystem code to use write barriers should apply for FUSE filesystems.

FUSE is slow by design, and it is used in cases where performance isn't
crucial.

> this isn't a matter of a few % in performance, if an application is
> sync-limited in a way that can be converted to write-ordered the potential is
> for the application to speed up my many times.
>
> programs that maintain indexes or caches of data that lives in other files
> will be able to write data && barrier && write index && fsync and double their
> performance vs write data && fsync && write index && fsync

They can do: write data with O_SYNC; write another piece of data with
O_SYNC.

And the only difference from barriers is the waiting time after the first
O_SYNC before the second I/O is submitted (such delay wouldn't happen with
barriers).

And now I/O delay is in milliseconds and process wakeup time is tens of
microseconds, it doesn't look like eliminating process wakeup time would
do more than few percents.

> databases can potentially do even better, today they need to fsync data to
> disk before they can update their journal to indicate that the data has been
> written, with a barrier they could order the writes so that the write to the
> journal doesn't happen until the writes of the data. they would neve need to
> call an fsync at all (when emptying the journal)

Good databases can pack several user transactions into one fsync() write.
If the database server is properly engineered, it accumulates all user
transactions committed so far into one chunk, writes that chunk with one
fsync() call and then reports successful commit to the clients.

So if you increase fsync() latency, it should have no effect on the
transactional throughput --- only on latency of transactions. Similarly,
if you decrease fsync() latency, it won't increase number of processed
transactions.

Certainly, there are primitive embedded database libraries that fsync()
after each transaction, but they don't have good performance anyway.

> for systems without solid-state drives or battery-backed caches, the ability
> to eliminate fsyncs by being able to rely on the order of the writes is a huge
> benifit.

I may ask --- where are the applications that require extra slow fsync()
latency? Databases are not that, they batch transactions.

If you want to improve things, you can try:
* implement O_DSYNC (like O_SYNC, but doesn't update inode mtime)
* implement range_fsync and range_fdatasync (sync on file range --- the
kernel has already support for that, you can just add a syscall)
* turn on FUA bit for O_DSYNC writes, that eliminates the need to flush
drive cache in O_DSYNC call

--- these are definitely less invasive than new I/O submitting interface.

Mikulas

> David Lang
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


david at lang

Oct 7, 2008, 10:16 AM

Post #7 of 7 (1030 views)
Permalink
Re: application syncing options (was Re: [PATCH] Memory management livelock) [In reply to]

On Tue, 7 Oct 2008, Mikulas Patocka wrote:

>>> If you invent new interface that allows submitting several ordered IOs
>>> from userspace, it will require excessive maintenance overhead over long
>>> period of time. So it should be only justified, if the performance
>>> improvement is excessive as well.
>>>
>>> It should not be like "here you improve 10% performance on some synthetic
>>> benchmark in one application that was rewritten to support the new
>>> interface" and then create a few more security vulnerabilities (because of
>>> the complexity of the interface) and damage overall Linux progress,
>>> because everyone is catching bugs in the new interface and checking it for
>>> correctness.
>>
>> the same benchmarks that show that it's far better for the in-kernel
>> filesystem code to use write barriers should apply for FUSE filesystems.
>
> FUSE is slow by design, and it is used in cases where performance isn't
> crucial.

FUSE is slow, but I don't believe that it's a design goal for it to be
slow, it's a limitation of the implementation. so things that could speed
it up would be a good thing.

>> this isn't a matter of a few % in performance, if an application is
>> sync-limited in a way that can be converted to write-ordered the potential is
>> for the application to speed up my many times.
>>
>> programs that maintain indexes or caches of data that lives in other files
>> will be able to write data && barrier && write index && fsync and double their
>> performance vs write data && fsync && write index && fsync
>
> They can do: write data with O_SYNC; write another piece of data with
> O_SYNC.
>
> And the only difference from barriers is the waiting time after the first
> O_SYNC before the second I/O is submitted (such delay wouldn't happen with
> barriers).
>
> And now I/O delay is in milliseconds and process wakeup time is tens of
> microseconds, it doesn't look like eliminating process wakeup time would
> do more than few percents.

each sync write needs to wait for a disk rotation (and a seek if you are
writing to different files). if you only do two writes you save one disk
rotation, if you do five writes you save four disk rotations

>> databases can potentially do even better, today they need to fsync data to
>> disk before they can update their journal to indicate that the data has been
>> written, with a barrier they could order the writes so that the write to the
>> journal doesn't happen until the writes of the data. they would neve need to
>> call an fsync at all (when emptying the journal)
>
> Good databases can pack several user transactions into one fsync() write.
> If the database server is properly engineered, it accumulates all user
> transactions committed so far into one chunk, writes that chunk with one
> fsync() call and then reports successful commit to the clients.

if there are multiple users doing transactions at the same time they will
benifit from overlapping the fsyncs. but each user session cannot complete
their transaction until the fsync completes

> So if you increase fsync() latency, it should have no effect on the
> transactional throughput --- only on latency of transactions. Similarly,
> if you decrease fsync() latency, it won't increase number of processed
> transactions.

only if you have all your transactions happening in parallel. in the real
world programs sometimes need to wait for one transaction to complete so
that they can do the next one.

> Certainly, there are primitive embedded database libraries that fsync()
> after each transaction, but they don't have good performance anyway.
>
>> for systems without solid-state drives or battery-backed caches, the ability
>> to eliminate fsyncs by being able to rely on the order of the writes is a huge
>> benifit.
>
> I may ask --- where are the applications that require extra slow fsync()
> latency? Databases are not that, they batch transactions.
>
> If you want to improve things, you can try:
> * implement O_DSYNC (like O_SYNC, but doesn't update inode mtime)
> * implement range_fsync and range_fdatasync (sync on file range --- the
> kernel has already support for that, you can just add a syscall)
> * turn on FUA bit for O_DSYNC writes, that eliminates the need to flush
> drive cache in O_DSYNC call
>
> --- these are definitely less invasive than new I/O submitting interface.

but all of these require that the application stop and wait for each
seperate write to take place before proceeding to the next step.

if this doesn't matter, then why the big push to have the in-kernel
filesystems start using barriers? I understood that this resulted in large
performance increases in the places that they are used from just being
able to avoid having to drain the entire request queue, and you are saying
that the applications would not only need to wait for the queue to flush,
but for the disk to acknowledge the write.

syncs are slow, in some cases _very_ slow.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.