On Thu, Dec 24, 2020 at 12:15:08PM -0500, Michael A. Smith wrote:
With all the buffering that modern disks and filesystems do, a specific question has come up a few times with respect to whether or not data was actually written after flush. I think it would be pretty useful for the standard library to have a variant in the io module that would explicitly fsync on close.
One argument against this idea is that "disks and file systems buffer for a reason, you should trust them, explicitly calling sync after every written file is just going to slow I/O down".
Personally I don't believe this argument, I've been bitten many, many times until I learned to explicitly sync files, but its an argument you should counter.
Another argument is that even syncing your data doesn't mean that the data is actually written to disk, since the hardware can lie. On the other hand, I don't know what anyone can do, not even the kernel, in the face of deceitful hardware.
You might be tempted to argue that this can be done very easily in Python already, so why include it in the standard io module?
- It seems to me that it would be better to do this in C, so for the
folks who need to make a consistency > performance kind of choice, they don't have to sacrifice any additional performance.
The actual I/O is surely going to outweigh the cost of calling sync from Python.
This sounds like a trivial micro-optimization for small files, and an undetectable one for large files on slow media. If you save a dozen microseconds when syncing a two gigabyte file written to a USB-2 stick, the sync might take four or five minutes. Are you even going to notice the difference?
I think you need to show benchmarks before claiming that this needs to be in C.
- Having it in the io library will call attention to this issue,
which I think is something a lot of folks don't consider. Assuming that `close` or `flush` are sufficient for consistency has always been wrong (at its limits), but it was less likely to be a stumbling block in the past, when buffering was less aggressive and less layered and the peak size and continuous-ness of data streams was a more niche concern.
I don't know, I wonder whether burying it in the io library will make it disappear.
Perhaps a "sync on close" keyword argument to open? At least then it is always available and easily discoverable.
- There are many ways to do this, and I think several of them could
be subtly incorrect.
Can you elaborate?
I mean, the obvious way is:
try: with open(..., 'w') as f: f.write("stuff") finally: os.sync()
so maybe all we really need is a "sync file" context manager.