[Python-ideas] Re: fsync-on-close io object

27 Dec 2020

      On Thu, Dec 24, 2020 at 6:18 PM Cameron Simpson  wrote:
...
On 25Dec2020 09:29, Steven D'Aprano  wrote:
...
On Thu, Dec 24, 2020 at 12:15:08PM -0500, Michael A. Smith wrote:
...
With all the buffering that modern disks and filesystems do, a
specific question has come up a few times with respect to whether or
not data was actually written after flush. I think it would be pretty
useful for the standard library to have a variant in the io module
that would explicitly fsync on close.
One argument against this idea is that "disks and file systems buffer
for a reason, you should trust them, explicitly calling sync after every
written file is just going to slow I/O down".
Personally I don't believe this argument, I've been bitten many, many
times until I learned to explicitly sync files, but its an argument you
should counter.
By contrast, I support this argument. The _vast_ majority of things
don't need to sync their data all the way to the hardware base substrate
(eg magnetic fields on spinning rust).
And on the whole, if I do care, I issue a single sync() call at the end
of a large task (typically interactively, at a prompt!) rather than
forcing a heap of performance impairing stutters all the way through
some process because many per-file syncs force that.
IMO, per-file syncs fall into the "policy" arena: aside from low level
tools (example: fdisk, a disc partition editor), to my mind the purpose
of the kernel is to accept responsibility for my data when I hand it
off.
Perhaps for you that isn't enough; for me it normally is. And when it
isn't, I'll take steps myself, _outside_ the programme, to ensure the
sync or commit or off site backup is complete when it matters. Thus the
policy is in my hands.
The tool which causes a per-file sync all on every close, or even after
every write, is a performance killer. The faster our hardware, the less
that may seem to matter (and, conversely, the less the risk as the
ordinary kernel I/O flushing will catch up faster). But when the
hardware slowness _is_ relevant, if I can't turn that off I have a
needlessly unperformant task.
The example which stands out in my own mind is when I was using firefox
on a laptop with a spinning rust hard drive (and being a laptop
hardware, a low power physically slow piece of spinning rust). There was
once a setting to turn off the synchronous-write sqlite setting (used
for history and bookmarks). That was _visibly obvious_ in the user
experience. And I turned it off. As a matter of policy, those data
didn't need such care.

...
So I'm resistant to this kind of thing because IMO it leads to an
attractive nuisance: over use of sync or fsync for everything. And it
will usually not be exposed as policy the user can adjust/disable.
My rule of thumb:
If it can't be turned off, it's not a feature. - Karl Heuer
Are you arguing that if something is a bad idea to overuse, even if
it's a good idea sometimes, then it shouldn't be allowed into Python,
because someone might write a program that abuses that feature, you
might end up with that program, and it would be irksome to deal with
it?

I'm not trying to present a straw man, but that is my genuine
impression of what you said. If I got it wrong, I apologize and please
help me understand what you meant.
...
...
Another argument is that even syncing your data doesn't mean that the
data is actually written to disk, since the hardware can lie. On the
other hand, I don't know what anyone can do, not even the kernel, in the
face of deceitful hardware.
Aye.
But in principle, after a sync() or fsync() the kernel at least believes
that. Hardware which lies, or which claims saved data without having the
rresources to guarrentee it (eg a small battery to complete the writes
if there's a power out) is indeed nasty.
...
...
You might be tempted to argue that this can be done very easily in
Python already, so why include it in the standard io module?
I would indeed. There _should_ be a small bar which at least causes the
programmer to think "do I really need this here"? I suppose a
"fsync=False" default parameter is a visible bar.
...
[...]
...
I mean, the obvious way is:
try:
       with open(..., 'w') as f:
           f.write("stuff")
   finally:
       os.sync()
An os.fsync(f.fileno()) is lower impact - os.sync() requests a sync of
all filesystems.
...
so maybe all we really need is a "sync file" context manager.
Aye. Fully agree here, and frankly think this is a "write your own"
situation. Except, of course, that like all "write your own" one/few
liners there will be suboptimal or buggy ones released. Such as the
"overly wide sync" from your os.sync() above.
Personally I'm -1 on this. A context manager while goes f.flush()
os.fsync(f.fileno()) seems plenty, and easy to roll your own.
There are very smart people on this list who have already demonstrated
that there is more than one way to do it, and that it's not obvious.
So, it's not easy to roll your own correctly.

I love context managers when they're alone, but I dislike stacking
them. It is less clear how we can ensure the fsync happens exactly
between flush and close with a context manager than a keyword argument
to open. That is, if open is the only context manager, everything is
great. But if is up to users to stack context managers including open
and some fsync, I think correct ordering will be a problem.

Thank you for engaging on this topic.

[Python-ideas] Re: fsync-on-close io object

Michael Smith