Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

some good news && bad news on fsync

 

 

Perl porters RSS feed   Index | Next | Previous | View Threaded


tchrist at perl

Mar 22, 2009, 10:19 AM

Post #1 of 1 (1063 views)
Permalink
some good news && bad news on fsync

I have good news and I have bad news, and the *first* good news is that my
news isn't very long. :-) If you'd prefer to cut to the chase, search for
/vipw where the new and interesting parts begin.

SUMMARY: Design advice/suggestions strongly requested. I don't know how
much of this should be in the core, or whether there should be
plain ol' modules for them, or new or improved pragmata, or new
I/O layers, or command-line flags, or enviro settings, or what.
There's a lot of room for design decisions, but I for one am
afraid of I/O layers. I'd rather chase vnode ops!

Please bear with me as I replay a highly excerpted version of what we've
been talking about, to provide a refresh and also context for folks only
now dropping by. I *have* gone to some small trouble to make it all more
legible via quoting and indentation (thank you, Damian, for the marvels of
Text::Autoformat, even if it isn't always perfect).

Tom »»» Consider the MacOS manpage of fsync:

Mark>»» Don't care about MacOS. Sorry. :-)

And there you're making a mistake, because their warning *is* pertinent
to this issue and not in any way O/S dependent. You just think it is.

Sorry.

I'll explain why momentarily.

And Mark: you're violating the spirit of Henry's 10th Commandment,
as Ted pointed out:

You may or may not be old enough to remember Henry Spencer's Ten
Commands of C Programmers, in particular the 10th commandment:

10. Thou shalt foreswear, renounce, and abjure the vile heresy which
claimeth that "All the world's a VAX", and have no commerce with
the benighted heathens who cling to this barbarous belief, that
the days of thy program may be long even though the days of thy
current machine be short.

Just s/VAX/Linux/ and you'll see your sin exposed. Penitenziagite!

Mark»» I think you are saying that even if you agreed to fsync being
Mark»» recommended for applications that need to ensure user application
Mark»» data consistency that use rename() to accomplish change-in-place,
Mark»» doing fsync() before Perl close() is insufficient?

Mark»» What about enabling auto-flush, print(), fsync(), close()?

Mark»» Even if it isn't 100% - do you agree that doing the fsync()
Mark»» increases the odds that a file system restore after system
Mark»» failure will significantly increase the changes of the file
Mark»» having content before doing the rename()? If it's a probability
Mark»» thing, might it still not be worth it for important user
Mark»» application data?

Mark»» Oh - one more - atomic rename() isn't actually part of this
Mark»» discussion - It's atomic change-in-place. That is, you have a
Mark»» file such as /etc/passwd that you want to change-in-place.

Mark»» Here is the common scenario:

Mark»» User is smart enough to know that open("> /etc/passwd") is a bad
Mark»» idea because if the system fails while writing /etc/passwd, it
Mark»» can be empty or half-written during startup. Users decide to be
Mark»» clever. What if we open("> /etc/passwd.new.$$") and once we are
Mark»» sure the file is good, we rename("/etc/passwd.new.$$",
Mark»» "/etc/passwd")? Seems like a GREAT idea? This will surely protect
Mark»» us in the case of system failure?

Mark»» 1) User is ignorant of the fact that the write() to
Mark»» /etc/passwd.new.$$ touched the file, but the rename()
Mark»» touched the directory, and the file system has license
Mark»» to change the order underneath.

It's worse than that, even. But your mentioning of that in particular
is what indirectly triggered a cascade that led at last to clarity,
as you shall soon see.

Mark»» 2) User is ignorant of rename() possibly being a
Mark»» link()/unlink() underneath.

Chas>» My concern is that sysadmins using one-liners like

Chas>» find ./ -type f xargs perl -pi -e 's/foo/bar'

I'll grant you your concern, but my viewpoint is that
cavalierly discarding backup files is foolishly error-prone.

Think but of those times you've done perl -i and forgotten the
- -p or -n -- and thus landed with absolutely nothing in your
poor zeroed files.

Chas>» will get bitten when the machine crashes and the only solution is
Chas>» to rewrite it as full Perl script to get the desired behaviour
Chas>» (which seems like massive overkill).  The worry about forcing
Chas>» fsync seems to be that if you make it mandatory it will degrade
Chas>» performance, but -i seems to be a special case where performance
Chas>» doesn't matter as much as being correct.

Tom >» I'm pretty sure this *must* be the responsibility of the kernel,
Tom >» not that of the application (user program) or the run-time system
Tom >» or infrastructure (be it perl, libc, etc).

Tom >» On the other hand, most of the manpages I cite *do* say:

Tom >» fsync() should be used by programs that require a file
Tom >» to be in a known state, for example, in building a simple
Tom >» transaction facility.

Tom >» which may put the onus back on the user program.

Mark» Yep. That's my starting point. We're coming at an unfortunate
Mark» truth from different starting points.

Tom >» In any event, I'm still far from sure it's a perl problem.

Mark» Given the code that I saw from Chas that does unlink() before
Mark» rename(), it makes it clear that it is not Perl's problem. Since
Mark» Perl didn't guarantee atomic change-in-place before - why should
Mark» it now? :-)

John» This looks like a special case. Could you not force the fsync
John» only if Perl is given the -i flag

Chas> I would think it would be as simple as adding an fsync before the
Chas> close in Perl_nextargv:

Chas> however_you_call_fsync_on_gv(gv);
Chas> do_close(gv,FALSE);
Chas> (void)PerlLIO_unlink(SvPVX_const(sv));
Chas> (void)PerlLIO_rename(PL_oldname,SvPVX_const(sv));
Chas> do_open(gv,(char*)SvPVX_const(sv),SvCUR(sv),PL_inplace!=0,
Chas> O_RDONLY,0,NULL);

Chas> At least I think so, based on the following code from the
Chas> Theodore T'so article[1] quoted earlier.

Chas> 1. http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/

I've now read Ted's cited posting plus all 188 the comments so far posted.
That's 5211 lines of text, so well over *200* 24-line screens. If you
think *my* stuff is long...

My GOODNESS but there's a lot of controversy, and sometimes bitterness as
well, all over little old fsync(2)! But if I'd had a system toasted because
of this, you may rest uneasy that I would be more vocal than any of these
these patient folks were. But you probably knew that already.

Three things struck me that Ted said:

T1. What I was saying is that number one, open-write-fsync-close-rename
really only provides atomicity and not durability, because the
containing directory isn't fsync'ed. [...] I question how common
those requirements really are, though, and whether we really need to
optimize for such a case.

Both those sentences merit serious reflection.

T2. Historically, BSD FFS sync'ed out meta-data every 5 seconds, and
data blocks every 30 seconds. In think you may be confusing the fact
that many file systems, including BSD's FFS and Reiserfs (from which
people seem to be found of quoting its design document) implemented
an atomic rename operation, in terms of what happened to the
directory entry, but BSD FFS at least never implemented anything
like what you were describing as far I know, and I've had people
comment that they've seen reiserfs generate zero-length files on a
crash. (Although perhaps that was due to some application doing an
open-truncate-write-close operation.)

This's not quite the way it works anymore, but close enough.

T3. The Linux block driver folks discussed it in great detail about two
years ago, and decided that for the sorts of things the Linux kernel
needed, FUA really didn't do enough to be worth battling the many
hard drives that simply implemented FUA wrongly, and SATA
controllers which silently dropped the FUA bit without letting the
device driver know, etc.

Ted also reminds people that just concentrating on Linux was
wrong-headed, even considering alone how many Solaris and BSD
(that includes Macs, remember) systems there are really out there.

His first cited comment sent me to check the whole user-space source tree
for instances of calling fsync(2). They were *REMARKABLY* scarce! If it
were something generaly needed, I'd've thought I'd see more of it. But I
didn't, which was curious.

And I was VERY interested that vipw(8) didn't use it, which is where I
started looking (nor do its libraries in libutil) because of Mark's comment.

However, I know why it's not there now, for one of the very few places it
*is* used is where Keith rather amusingly writes:

/usr/src/usr.bin/vi/ex/ex_write.c

/*
* XXX
* I don't trust NFS -- check to make sure that we're talking to
* a regular file and sync so that NFS is forced to flush.
*/
if (!fstat(fileno(fp), &sb) &&
S_ISREG(sb.st_mode) && fsync(fileno(fp)))
goto err;

Not that I don't share Keith's sentiment, mind you.

Considering Ted's 3rd comment, I wondered about device controllers and
sync'd-to-iron-oxide directives.

So I stuck my head into /sys/dev/ata/wd.c.

And *THAT* was the precise point at which I decided things had gone quite
far enough, thank you very much. I've spent too many days sight-seeing
/sys, but this was just beyond the pale. Clearly I needed professional
help, and fast.

So, of course, I called on Kirk to set me straight.

And THAT'S both the good news *and* the bad., for I can now briefly
explain how matters are *both* worse and better than I'd previously
understood them to be.

It turns out that the warning in the MacOS fsync(2) manpage, and Ted's
comment #3, are INDEED relevant. It's worse than you know: there are ATA
disks that actually accept directives to commit to iron-oxide and LIE to
you, continuing to do as they please. The latest version of the ATA tag-
queueing spec finally gets this right (demands no lies be told, and that it
be done), but honest compliance is a different matter. SCSI disks do not
have this failing, and I've always run SCSI systems even on my Intel boxes.
This may have granted me some degree of protection from disaster.

So here's the bad news.

It also turns out that Mark was right in that even with soft updates on
FFS, it *is* POSSIBLE to get 0-length file. Kirk advises that for
reliability, one must do as Keith does in vi: an fsync(2) on the FD
right before you close it.

But he goes further than that, and this now is where the good
news comes in.

See, just because you have the file's data where it belongs, if you're
about to do a rename, or rather, have just done one, you have still
another problem. Available choices number three:

A: sync the whole filesystem (VERY SLOW)
B: open the parent directory and fsync that (SLOW),
C: open the target of the rename and fsync that (QUICK,
but may not work on all systems)

Well, guess what? Turns out that choice C on BSD is a *very* good choice,
because here, syncing a file doesn't sync that file alone, but also each
component name in /the/entire/directory/path all the way up to the root!
That means you don't have to pay for B, let alone A, to get what you want
to happen in a DWIMmer sort of way.

So it tidily and automagically solves the rename problem, since you
don't have to pay for the expensive fsync(dirfd(opendir(PARENT))).
Good dweomer, not foul.

[ BUG BUG BUG: Perl's fileno(DH) doesn't turn into dirfd(3)! ]

Ok, fine: obviously it can't do so for open files wholly unlinked from
the filesystem, but that's perfectly fine, as namei couldn't find
those anyway. :-)

Isn't that SWELL? Pity for Mark though, since he stated that he doesn't
care about systems that'd give a "it just works" solution. Sorry.

Now about perl -i.

My understanding is that primary concern involves the processing of many
files when that involves an implicit rename WITHOUT a backup, such as
might occur in cavalier commands like:

$ perl -pi -e 's/foo/bar/' *.[ch]

Instead of the infinitely more prudent:

$ perl -p -i.orig -e 's/foo/bar/' *.[ch]
or
$ perl -p -i'SAVDIR/*' -e 's/foo/bar/' *.[ch]

Because the most worrisome scenario is when you have *so* many files
that you really need help from xargs to process them (NCARGS limit):

$ find dir -type f -print0 | xargs -0 perl -pi-PREFKT -e 's/foo/bar/'

I think one should run some latency and concurrency tests. I'm
very worried about the price of implementing "the fix", especially
on non-BSD systems.

Why? Well, imagine processing hundreds, perhaps even thousands of files.
If you have to pay 1-5s on an unloaded system to fsync, would you *still*
want this done? Really? Even if it meant going from updating hundreds of
files per second to seconds per each of those hundred files?

About this scenario, Kirk wrote (quoted with permission):

If you are going to do a bulk rename it is probably more sensible
to move all of them into their new directory, then just do a single
fsync of the destination directory. That will be quicker than doing
the fsync on each file in turn.

POSIX does not specify that an fsync has to sync the name(s) only
that it sync the data. While BSD syncs the name(s) and checks for
changes all the way to the root, most other systems like Linux do
not (as far as I know). For most systems you must sync every
modified directory to ensure that all the name(s) have been written.

(I have enough problems with huge MH directories holding zillions
of files, as last time I checked, namei was still linear on FFS.)

I DON'T think we dare do this fsync for every situation. Seems best
to require a pragma or something, but then the people who most need
it won't every think/know to use it.

Bit of a conundrum, really.

Anybody got fresh ideas on this?

Me, I've been thinking about the open pragma. I do use it, but I think it
could do more. Right now it's mostly just about encodings, although it
allows for :crlf and :raw layers, too. How 'bout if there were a way to
specify that opens (not sysopens) defaulted to having a couple more
capabilities?

I'm thinking specifically of it being able to turn O_EXLOCK on by default:
which the -i DOES NOT DO: TSK-TSK-TSK!!

Maybe it could be a :lock "layer"? Or for the O_SYNC flag, a :sync.

No, that won't work. That's going to force *all* output to fsync, and
that's not what seems desired here. One would rather have fsync on close
only. Hm.

But *some* sort of pragma, not necessarily C<< use open OUT => ":sync" >>,
could specify this, and moreover, it could also be respected by rename()
and truncate(), implementing the directory stuff per Kirk's directions
that I've relayed above.

I don't know how much of this should be in the core, or whether there
should be plain ol' modules for them, or new or improved pragmata, or new
i/o layers, or command line flags, or enviro settings, or what. There's a
lot of room for design decisions, but I for one am afraid of I/O layers.

Anyway, couldn't we please fix the silenced flush and close bugs first?

- --tom

PS: I'm still testing out the no-emoticon=no-laugh-track mode of humor.
Within this posting is probably one of the most ironically hilarious
things I've ever written to this list. If anybody thinks they know
what it is, send me private email and I'll tell you if you "got" it.
This will be a metric of how well the no-smileys thing's working out.

- --

"When I read commentary about suggestions for where C should go,
I often think back and give thanks that it wasn't developed
under the advice of a worldwide crowd."
--Dennis Ritchie

------- End of Unsent Draft

Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.