Re: [Python-ideas] PEP471 - (os.scandir())

30 Nov 2015


      On Nov 29, 2015, at 22:45, Nick Coghlan  wrote:
...
On 28 November 2015 at 04:42, Andrew Barnert via Python-ideas
 wrote:
...
On Nov 27, 2015, at 10:32, Paul Moore  wrote:
...
...
On 26 November 2015 at 23:22, Erik  wrote:
I have studied the PEP, followed a lot of the references and looked at the
3.5.0 implementation. I can't see that I've missed such a thing already
existing, but it's possible. If so, perhaps this is instead a request to
make that thing more obvious somehow!
Does pathlib use scandir? If so, then maybe you get the caching
benefits by using pathlib? And if pathlib doesn't use scandir, maybe
it should? [I just checked, it looks like pathlib doesn't use scandir
:-(]
Does pathlib even have a walk equivalent? (I know it has glob('**'), but that's not the same thing.)
Or are you suggesting that people should use path.iterdir with explicit recursion (or an explicit stack), and therefore just changing iterdir to use scandir (and prefill as many cached attribs as possible in each result) is what we want?
The main problem with having pathlib do any caching at all is that
caching the results of stat calls implicitly in any context is a
recipe for significant confusion, since you're at the mercy of race
conditions as the filesystem changes out from underneath you. There
also isn't an obviously "right" answer in the general case for cache
invalidation, as in some cases you're interested in the file as it was
when you originally opened it, and don't care if it got swapped out
from underneath you, while in others you're interested in the file
path, and want the filesystem details for right now, not the details
from a few seconds ago.
For os.scandir(), we just delegate the behaviour to the underlying
filesystem APIs - how readdir() and FindNextFile react to the
directory contents changing during iteration is OS defined, and Python
will inherit that variation (and may miss newly added files as a
result).
The current os.walk() implementation constrains the scope of the
scandir() filesystem state caching, since it doesn't let the DirEntry
objects escape outside the generator
That's a good point. The fts functions that both BSD and GNU use to replace the ftw and the various other old *nix filesystem walk functions deal with this carefully; the short version is that any information you want to keep around from the current file after looking at the next file, you have to explicitly copy it, which makes it hard to confuse yourself about how up-to-date it is. (You can also go a directory at a time, more like os.walk, but with the same basic restriction: once you go to the next directory, the previous list of file entries is invalid.)
...
- there's no need to ask yourself
"What's the risk of stale filesystem data here?", since you're not
getting access to the cached info in the first place, and hence always
need to go query the filesystem directly.
This is a fairly universal pattern: for a given *application* you can
likely figure out what to cache and when to invalidate it, even though
those are unanswerable questions in the general case. Another example
of that would be the stat caches in the current implementation of the
import system, together with the corresponding need to call
importlib.invalidate_caches() if you want to make sure the import
system can see a module that was only just written to disk.
That's not to say that a general purpose directory walking utility
producing DirEntry objects isn't an interesting prospect. Rather, it's
an attempt to highlight that this is an area where there may be a
significant gulf between "works for my use case" and "is a suitable
addition to the standard library", particularly since this can now be
a pure Python recipe atop os.scandir.
I still think providing an fts-like API instead of os.walk would be the clearest way to provide cached data (especially since people could look up nice generic documentation on fts). I don't think it would be too hard to emulate it (or a large enough subset--you can ask fts to return anything from just names to full stat structs, with well-defined performance characteristics for each combination of flags, and that probably can't be exactly the same) on top of scandir (or directly on Windows FindFirst/etc.), but then I said that when the PEP was being discussed and then never had time to actually try it... The bigger problem is that "you have to copy it, which is painful enough that you can't confuse yourself" is a much lower barrier to confusion in Python than in C, so it might not be as effective.