On 25Dec2020 09:29, Steven D'Aprano firstname.lastname@example.org wrote:
On Thu, Dec 24, 2020 at 12:15:08PM -0500, Michael A. Smith wrote:
With all the buffering that modern disks and filesystems do, a specific question has come up a few times with respect to whether or not data was actually written after flush. I think it would be pretty useful for the standard library to have a variant in the io module that would explicitly fsync on close.
One argument against this idea is that "disks and file systems buffer for a reason, you should trust them, explicitly calling sync after every written file is just going to slow I/O down".
Personally I don't believe this argument, I've been bitten many, many times until I learned to explicitly sync files, but its an argument you should counter.
By contrast, I support this argument. The _vast_ majority of things don't need to sync their data all the way to the hardware base substrate (eg magnetic fields on spinning rust).
And on the whole, if I do care, I issue a single sync() call at the end of a large task (typically interactively, at a prompt!) rather than forcing a heap of performance impairing stutters all the way through some process because many per-file syncs force that.
IMO, per-file syncs fall into the "policy" arena: aside from low level tools (example: fdisk, a disc partition editor), to my mind the purpose of the kernel is to accept responsibility for my data when I hand it off.
Perhaps for you that isn't enough; for me it normally is. And when it isn't, I'll take steps myself, _outside_ the programme, to ensure the sync or commit or off site backup is complete when it matters. Thus the policy is in my hands.
The tool which causes a per-file sync all on every close, or even after every write, is a performance killer. The faster our hardware, the less that may seem to matter (and, conversely, the less the risk as the ordinary kernel I/O flushing will catch up faster). But when the hardware slowness _is_ relevant, if I can't turn that off I have a needlessly unperformant task.
The example which stands out in my own mind is when I was using firefox on a laptop with a spinning rust hard drive (and being a laptop hardware, a low power physically slow piece of spinning rust). There was once a setting to turn off the synchronous-write sqlite setting (used for history and bookmarks). That was _visibly obvious_ in the user experience. And I turned it off. As a matter of policy, those data didn't need such care.
So I'm resistant to this kind of thing because IMO it leads to an attractive nuisance: over use of sync or fsync for everything. And it will usually not be exposed as policy the user can adjust/disable.
My rule of thumb:
If it can't be turned off, it's not a feature. - Karl Heuer
Another argument is that even syncing your data doesn't mean that the data is actually written to disk, since the hardware can lie. On the other hand, I don't know what anyone can do, not even the kernel, in the face of deceitful hardware.
But in principle, after a sync() or fsync() the kernel at least believes that. Hardware which lies, or which claims saved data without having the rresources to guarrentee it (eg a small battery to complete the writes if there's a power out) is indeed nasty.
You might be tempted to argue that this can be done very easily in Python already, so why include it in the standard io module?
I would indeed. There _should_ be a small bar which at least causes the programmer to think "do I really need this here"? I suppose a "fsync=False" default parameter is a visible bar.
I mean, the obvious way is:
try: with open(..., 'w') as f: f.write("stuff") finally: os.sync()
An os.fsync(f.fileno()) is lower impact - os.sync() requests a sync of all filesystems.
so maybe all we really need is a "sync file" context manager.
Aye. Fully agree here, and frankly think this is a "write your own" situation. Except, of course, that like all "write your own" one/few liners there will be suboptimal or buggy ones released. Such as the "overly wide sync" from your os.sync() above.
Personally I'm -1 on this. A context manager while goes f.flush() os.fsync(f.fileno()) seems plenty, and easy to roll your own.
Cheers, Cameron Simpson email@example.com