On 28 November 2015 at 04:42, Andrew Barnert via Python-ideas
On Nov 27, 2015, at 10:32, Paul Moore firstname.lastname@example.org wrote: >
On 26 November 2015 at 23:22, Erik email@example.com wrote: I have studied the PEP, followed a lot of the references and looked at the 3.5.0 implementation. I can't see that I've missed such a thing already existing, but it's possible. If so, perhaps this is instead a request to make that thing more obvious somehow!
Does pathlib use scandir? If so, then maybe you get the caching benefits by using pathlib? And if pathlib doesn't use scandir, maybe it should? [I just checked, it looks like pathlib doesn't use scandir :-(]
Does pathlib even have a walk equivalent? (I know it has glob('**'), but that's not the same thing.)
Or are you suggesting that people should use path.iterdir with explicit recursion (or an explicit stack), and therefore just changing iterdir to use scandir (and prefill as many cached attribs as possible in each result) is what we want?
The main problem with having pathlib do any caching at all is that caching the results of stat calls implicitly in any context is a recipe for significant confusion, since you're at the mercy of race conditions as the filesystem changes out from underneath you. There also isn't an obviously "right" answer in the general case for cache invalidation, as in some cases you're interested in the file as it was when you originally opened it, and don't care if it got swapped out from underneath you, while in others you're interested in the file path, and want the filesystem details for right now, not the details from a few seconds ago.
For os.scandir(), we just delegate the behaviour to the underlying filesystem APIs - how readdir() and FindNextFile react to the directory contents changing during iteration is OS defined, and Python will inherit that variation (and may miss newly added files as a result).
The current os.walk() implementation constrains the scope of the scandir() filesystem state caching, since it doesn't let the DirEntry objects escape outside the generator - there's no need to ask yourself "What's the risk of stale filesystem data here?", since you're not getting access to the cached info in the first place, and hence always need to go query the filesystem directly.
This is a fairly universal pattern: for a given application you can likely figure out what to cache and when to invalidate it, even though those are unanswerable questions in the general case. Another example of that would be the stat caches in the current implementation of the import system, together with the corresponding need to call importlib.invalidate_caches() if you want to make sure the import system can see a module that was only just written to disk.
That's not to say that a general purpose directory walking utility producing DirEntry objects isn't an interesting prospect. Rather, it's an attempt to highlight that this is an area where there may be a significant gulf between "works for my use case" and "is a suitable addition to the standard library", particularly since this can now be a pure Python recipe atop os.scandir.
-- Nick Coghlan | firstname.lastname@example.org | Brisbane, Australia