PEP 428: stat caching undesirable?
Hi all, I write as a python lover for over 13 years who's always wanted something like PEP 428 in Python. I am concerned about the caching of stat() results as currently defined in the PEP. This means that all behaviour built on top of stat(), such as p.is_dir(), p.is_file(), p.st_size and the like can indefinitely hold on to stale data until restat() is called, and I consider this confusing. Perhaps in recognition of this, p.exists() is implemented differently, and it does restat() internally (although the PEP does not document this). If this behaviour is maintained, then at the very least this makes the API more complicated to document: some calls cache as a side effect, others update the cache as a side effect, and others, such as lstat(), don't cache at all. This also introduces a divergence of behaviour between os.path.isfile() and p.is_file(), that is confusing and will also need to be documented. I'm concerned about scenarios like users of the library polling, for example, for some file to appear, and being confused about why the arguably more sloppy poll for p.exists() works while a poll for p.is_file(), which expresses intent better, never terminates. In theory the caching mechanism could be further refined to only hold onto cached results for a limited amount of time, but I would argue this is unnecessary complexity, and caching should just be removed, along with restat(). Isn't the whole notion that stat() need to be cached for performance issues somewhat of a historical relic of older OS's and filesystem performance? AFAIK linux already has stat() caching as a side-effect of the filesystem layer's metadata caching. How does Windows and Mac OS fare here? Are there benchmarks proving that this is serious enough to complicate the API? If the ability to cache stat() calls is deemed important enough, how about a different API where is_file(), is_dir() and the like are added as methods on the result object that stat() returns? Then one can hold onto a stat() result as a temporary object and ask it multiple questions without doing another OS call, and is_file() etc. on the Path object can be documented as being forwarders to the stat() result just as p.st_size is currently - except that I believe they should forward to a fresh, uncached stat() call every time. I write directly to this list instead raising it to Antoine Pitrou in private just because I don't want to make extra work for him to first receive my feedback and the re-raise it on this list. If this is wrong or disrespectful, I apologize. -- Pieter Nagel
On Wed, May 1, 2013 at 5:32 PM, Pieter Nagel
Isn't the whole notion that stat() need to be cached for performance issues somewhat of a historical relic of older OS's and filesystem performance? AFAIK linux already has stat() caching as a side-effect of the filesystem layer's metadata caching. How does Windows and Mac OS fare here? Are there benchmarks proving that this is serious enough to complicate the API?
System calls typically release the GIL in threaded code (due to the possibility the underlying filesystem may be a network mount), which ends up being painfully expensive. The repeated stat calls also appear to be one of the main reasons walkdir is so much slower than performing the same operations in a loop rather than using a generator pipeline as walkdir does (see http://walkdir.readthedocs.org), although I admit it was a year or two ago I made those comparisons, and it wasn't the most scientific of benchmarking efforts. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Hi,
On Wed, 01 May 2013 09:32:28 +0200
Pieter Nagel
Hi all,
I write as a python lover for over 13 years who's always wanted something like PEP 428 in Python.
I am concerned about the caching of stat() results as currently defined in the PEP. This means that all behaviour built on top of stat(), such as p.is_dir(), p.is_file(), p.st_size and the like can indefinitely hold on to stale data until restat() is called, and I consider this confusing.
I understand it might be confusing. On the other hand, if you call is_dir() then is_file() (then perhaps another metadata-reading operation), and there's a race condition where the file is modified in-between, you could have pretty much nonsensical results, if not for the caching.
Isn't the whole notion that stat() need to be cached for performance issues somewhat of a historical relic of older OS's and filesystem performance? AFAIK linux already has stat() caching as a side-effect of the filesystem layer's metadata caching. How does Windows and Mac OS fare here? Are there benchmarks proving that this is serious enough to complicate the API?
Surprisingly enough, some network filesystems have rather bad stat() performance. This has been reported for years as an issue with Python's import machinery, until 3.3 added a caching scheme where stat() calls are no more issued for each and every path directory and each and every imported module. But as written above caching is also a matter of functionality. I'll let other people chime in.
If the ability to cache stat() calls is deemed important enough, how about a different API where is_file(), is_dir() and the like are added as methods on the result object that stat() returns?
That's a good idea too. It isn't straightforward since os.stat() is implemented in C. Regards Antoine.
Antoine and Nick have convinced me that stat() calls can be a performance issue. I am still concerned about the best way to balance that against the need to keep the API simple, though. I'm still worried about the current behaviour that some path can answer True to is_file() in a long-running process just because it had been a file last week. In my experience there are use cases where most stat() calls one makes (including indirectly via is_file() and friends) want up-to-date data. There is also the risk of obtaining a Path object that already had its stat() value cached some time ago without your knowledge (i.e. if the Path was created for you by a walkdir type function that in its turn also called is_file() before returning the result). And needing to precede each is_file() etc. call with a restat() call whose return value is not even used introduces undesirable temporal coupling between the restat() and is_file() call. I see a few alternative solution, not mutually exclusive: 1) Change the signature of stat(), and everything that indirectly uses stat(), to take an optional 'fresh' keyword argument (or some synonym). Then stat(fresh=True) becomes synonymous with the current restat(), and the latter can be removed. Queries like is_file(fresh=True) will be implemented by forwarding fresh to the underlying stat() call they are implemented on. What the default for 'fresh' should be, can be debated, but I'd argue for the sake of naive code that fresh should default to True, and then code that is aware of stat() caching can use fresh=False as required. 2) The root of the issue is keeping the cached stat() value indefinitely. Therefore, limit the duration for which the cached value is valid. The challenge is to find a way to express how long the value should be cached, without needing to call time.monotonic() or the like that presumable are also OS calls that will release the GIL. One way would be to compute the number of virtual machine instructions executed since the stat() call was cached, and set the limit there. Is that still possible, now that sys.setcheckinterval() has been gutted? 3) Leave it up to performance critical code, such as the import machinery, or walkdirs that Nick mentioned, to do their own caching, and simplify the filepath API for the simple case. But one can still make life easier for code like that, by adding is_file() and friends on the stat result object as I suggested. But this almost sounds like a PEP of its own, because although pahtlib will benefit by it, it is actually an orthogonal issue. It raises all kinds of issues: should the signature be statresult.isfile() to match os.path, or statresult.is_file() to match PEP 428? -- Pieter Nagel
3) Leave it up to performance critical code, such as the import machinery, or walkdirs that Nick mentioned, to do their own caching, and simplify the filepath API for the simple case.
But one can still make life easier for code like that, by adding is_file() and friends on the stat result object as I suggested.
+1 from me. PEP 428 goes in the right direction with a distinction between "pure" path and "concrete" path. Pure path support syntactic operations, whereas I would expect concrete paths to actually access the file system. Having a method like restat() is a hint that something's wrong, I'm convinced this will bite some people. I'm also be in favor of having a wrapper class around os.stat() result which would export utility methods such as is_file()/is_directory() and owner/group, etc attributes. That way, the default behavior would be correct, and this helper class would make it easier for users like walkdir() to implement their own caching. As an added benefit, this would make path objects actually immutable, which is always a good thing (simpler, and you get thread-safety for free).
On 1 May 2013 22:58, "Charles-François Natali"
3) Leave it up to performance critical code, such as the import machinery, or walkdirs that Nick mentioned, to do their own caching, and simplify the filepath API for the simple case.
But one can still make life easier for code like that, by adding is_file() and friends on the stat result object as I suggested.
+1 from me. PEP 428 goes in the right direction with a distinction between "pure" path and "concrete" path. Pure path support syntactic operations, whereas I would expect concrete paths to actually access the file system. Having a method like restat() is a hint that something's wrong, I'm convinced this will bite some people.
I'm also be in favor of having a wrapper class around os.stat() result which would export utility methods such as is_file()/is_directory() and owner/group, etc attributes.
That way, the default behavior would be correct, and this helper class would make it easier for users like walkdir() to implement their own caching.
Walkdir is deliberately built as a decoupled pipeline modelled on os.walk - the only way it can really benefit from caching without destroying the API is if the caching is built into the underlying path objects that are then passed through the pipeline stages. However, I like the idea of a rich "stat" object, with "path.stat()" and "path.cached_stat()" accessors on the path objects. Up to date data by default, easy caching for use cases that need it without needing to pass the stat data around separately. Cheers, Nick.
As an added benefit, this would make path objects actually immutable, which is always a good thing (simpler, and you get thread-safety for free). _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
I've not got the full context, but I would like to make it *very* clear in the API (e.g. through naming of the methods) when you are getting a possibly cached result from stat(), and I would be very concerned if existing APIs were going to get caching behavior. For every use cases that benefits from caching there's a complementary use case that caching breaks. Since both use cases are important we must offer both APIs, in a way that makes it clear to even the casual reader of the code what's going on. -- --Guido van Rossum (python.org/~guido)
Am 01.05.2013 16:39, schrieb Guido van Rossum:
I've not got the full context, but I would like to make it *very* clear in the API (e.g. through naming of the methods) when you are getting a possibly cached result from stat(), and I would be very concerned if existing APIs were going to get caching behavior. For every use cases that benefits from caching there's a complementary use case that caching breaks. Since both use cases are important we must offer both APIs, in a way that makes it clear to even the casual reader of the code what's going on.
I deem caching of stat calls as problematic. The correct and contemporary result of a stat() call has security implications, too. For example stat() is used to prevent TOCTOU race conditions such as [1]. Caching is useful but I would prefer explicit caching rather than implicit and automatic caching of stat() results. We can get a greater speed up for walkdir() without resorting to caching, too. Some operating systems and file system report the file type in the dirent struct that is returned by readdir(). This reduces the number of stat calls to zero. Christian [1] https://www.securecoding.cert.org/confluence/display/seccode/POS01-C.+Check+...
We can get a greater speed up for walkdir() without resorting to
caching, too. Some operating systems and file system report the file type in the dirent struct that is returned by readdir(). This reduces the number of stat calls to zero.
Yes, definitely. This is exactly what my os.walk() replacement, "Betterwalk", does: https://github.com/benhoyt/betterwalk#readme On Windows you get *all* stat information from iterating the directory entries (FindFirstFile etc). And on Linux most of the time you get enough for os.walk() not to need an extra stat (though it does depend on the file system). I still hope to clean up Betterwalk and make a C version so we can use it in the standard library. In many cases it speeds up os.walk() by several times, even an order of magnitude in some cases. I intend for it to be a drop-in replacement for os.walk(), just faster. -Ben
On 2 May 2013 02:22, "Christian Heimes"
Am 01.05.2013 16:39, schrieb Guido van Rossum:
I've not got the full context, but I would like to make it *very* clear in the API (e.g. through naming of the methods) when you are getting a possibly cached result from stat(), and I would be very concerned if existing APIs were going to get caching behavior. For every use cases that benefits from caching there's a complementary use case that caching breaks. Since both use cases are important we must offer both APIs, in a way that makes it clear to even the casual reader of the code what's going on.
I deem caching of stat calls as problematic. The correct and contemporary result of a stat() call has security implications, too. For example stat() is used to prevent TOCTOU race conditions such as [1]. Caching is useful but I would prefer explicit caching rather than implicit and automatic caching of stat() results.
We can get a greater speed up for walkdir() without resorting to caching, too. Some operating systems and file system report the file type in the dirent struct that is returned by readdir(). This reduces the number of stat calls to zero.
While I agree exposing dirent in some manner is desirable, note that I'm not talking about os.walk itself, but the generator pipeline library I built around it in an attempt to break up monolithic directory walking loops into reusable components. Once you get out of the innermost generator, the only state passed through each stage is the path information (and the directory descriptor if using os.fwalk). Upgrading walkdir from simple strings to path objects would be relatively straightforward, but you can't change the API too much before it isn't similar to os.walk any more. The security issues only come into play in the outer loop which actually tries to *do* something with the pipeline output. However, even that case should involve at most two stat calls: one inside the pipeline (cached per iteration) and then a more timely one in the outer loop (assuming using os.fwalk as the base loop instead of os.walk doesn't already cover it). Cheers, Nick.
Christian
[1]
https://www.securecoding.cert.org/confluence/display/seccode/POS01-C.+Check+...
Yes, definitely. This is exactly what my os.walk() replacement, "Betterwalk", does: https://github.com/benhoyt/betterwalk#readme
On Windows you get *all* stat information from iterating the directory entries (FindFirstFile etc). And on Linux most of the time you get enough for os.walk() not to need an extra stat (though it does depend on the file system).
I still hope to clean up Betterwalk and make a C version so we can use it in the standard library. In many cases it speeds up os.walk() by several times, even an order of magnitude in some cases. I intend for it to be a drop-in replacement for os.walk(), just faster.
Actually, there's Gregory's scandir() implementation (returning a generator to be able to cope with large directories) on it's way: http://bugs.python.org/issue11406 It's already been suggested to make it return a tuple (with d_type). I'm sure a review of the code (especially the Windows implementation) will be welcome.
On Wed, 2013-05-01 at 23:54 +1000, Nick Coghlan wrote:
However, I like the idea of a rich "stat" object, with "path.stat()" and "path.cached_stat()" accessors on the path objects.
Since it seems there is some support for my proposal, I just posted to python-ideas to get an idea how much support there would be for such a PEP. -- Pieter Nagel
Actually, there's Gregory's scandir() implementation (returning a generator to be able to cope with large directories) on it's way:
http://bugs.python.org/issue11406
It's already been suggested to make it return a tuple (with d_type). I'm sure a review of the code (especially the Windows implementation) will be welcome.
Ah, thanks for the pointer, I hadn't seen that. Definitely looks like I should "merge" Betterwalk with it. I'll see if I can spend some time on it again soon. I'd love to see scandir/iterdir go into Python 3.4, and I'd be very chuffed if iterdir_stat got in, because that's the one that can really start speeding up operations like os.walk(). -Ben
participants (7)
-
Antoine Pitrou
-
Ben Hoyt
-
Charles-François Natali
-
Christian Heimes
-
Guido van Rossum
-
Nick Coghlan
-
Pieter Nagel