On Thu, Dec 24, 2020 at 6:18 PM Cameron Simpson
On 25Dec2020 09:29, Steven D'Aprano
wrote: On Thu, Dec 24, 2020 at 12:15:08PM -0500, Michael A. Smith wrote:
With all the buffering that modern disks and filesystems do, a specific question has come up a few times with respect to whether or not data was actually written after flush. I think it would be pretty useful for the standard library to have a variant in the io module that would explicitly fsync on close.
One argument against this idea is that "disks and file systems buffer for a reason, you should trust them, explicitly calling sync after every written file is just going to slow I/O down".
Personally I don't believe this argument, I've been bitten many, many times until I learned to explicitly sync files, but its an argument you should counter.
By contrast, I support this argument. The _vast_ majority of things don't need to sync their data all the way to the hardware base substrate (eg magnetic fields on spinning rust).
And on the whole, if I do care, I issue a single sync() call at the end of a large task (typically interactively, at a prompt!) rather than forcing a heap of performance impairing stutters all the way through some process because many per-file syncs force that.
IMO, per-file syncs fall into the "policy" arena: aside from low level tools (example: fdisk, a disc partition editor), to my mind the purpose of the kernel is to accept responsibility for my data when I hand it off.
Perhaps for you that isn't enough; for me it normally is. And when it isn't, I'll take steps myself, _outside_ the programme, to ensure the sync or commit or off site backup is complete when it matters. Thus the policy is in my hands.
The tool which causes a per-file sync all on every close, or even after every write, is a performance killer. The faster our hardware, the less that may seem to matter (and, conversely, the less the risk as the ordinary kernel I/O flushing will catch up faster). But when the hardware slowness _is_ relevant, if I can't turn that off I have a needlessly unperformant task.
The example which stands out in my own mind is when I was using firefox on a laptop with a spinning rust hard drive (and being a laptop hardware, a low power physically slow piece of spinning rust). There was once a setting to turn off the synchronous-write sqlite setting (used for history and bookmarks). That was _visibly obvious_ in the user experience. And I turned it off. As a matter of policy, those data didn't need such care.
So I'm resistant to this kind of thing because IMO it leads to an attractive nuisance: over use of sync or fsync for everything. And it will usually not be exposed as policy the user can adjust/disable.
My rule of thumb:
If it can't be turned off, it's not a feature. - Karl Heuer
Are you arguing that if something is a bad idea to overuse, even if it's a good idea sometimes, then it shouldn't be allowed into Python, because someone might write a program that abuses that feature, you might end up with that program, and it would be irksome to deal with it? I'm not trying to present a straw man, but that is my genuine impression of what you said. If I got it wrong, I apologize and please help me understand what you meant.
Another argument is that even syncing your data doesn't mean that the data is actually written to disk, since the hardware can lie. On the other hand, I don't know what anyone can do, not even the kernel, in the face of deceitful hardware.
Aye.
But in principle, after a sync() or fsync() the kernel at least believes that. Hardware which lies, or which claims saved data without having the rresources to guarrentee it (eg a small battery to complete the writes if there's a power out) is indeed nasty.
You might be tempted to argue that this can be done very easily in Python already, so why include it in the standard io module?
I would indeed. There _should_ be a small bar which at least causes the programmer to think "do I really need this here"? I suppose a "fsync=False" default parameter is a visible bar.
[...]
I mean, the obvious way is:
try: with open(..., 'w') as f: f.write("stuff") finally: os.sync()
An os.fsync(f.fileno()) is lower impact - os.sync() requests a sync of all filesystems.
so maybe all we really need is a "sync file" context manager.
Aye. Fully agree here, and frankly think this is a "write your own" situation. Except, of course, that like all "write your own" one/few liners there will be suboptimal or buggy ones released. Such as the "overly wide sync" from your os.sync() above.
Personally I'm -1 on this. A context manager while goes f.flush() os.fsync(f.fileno()) seems plenty, and easy to roll your own.
There are very smart people on this list who have already demonstrated that there is more than one way to do it, and that it's not obvious. So, it's not easy to roll your own correctly. I love context managers when they're alone, but I dislike stacking them. It is less clear how we can ensure the fsync happens exactly between flush and close with a context manager than a keyword argument to open. That is, if open is the only context manager, everything is great. But if is up to users to stack context managers including open and some fsync, I think correct ordering will be a problem. Thank you for engaging on this topic.