[Python-ideas] PEP471 - (os.scandir())

Mon Nov 30 04:12:23 EST 2015

On Nov 29, 2015, at 22:45, Nick Coghlan <ncoghlan at gmail.com> wrote:
> 
> On 28 November 2015 at 04:42, Andrew Barnert via Python-ideas
> <python-ideas at python.org> wrote:
>> On Nov 27, 2015, at 10:32, Paul Moore <p.f.moore at gmail.com> wrote:
>>> 
>>>> On 26 November 2015 at 23:22, Erik <python at lucidity.plus.com> wrote:
>>>> I have studied the PEP, followed a lot of the references and looked at the
>>>> 3.5.0 implementation. I can't see that I've missed such a thing already
>>>> existing, but it's possible. If so, perhaps this is instead a request to
>>>> make that thing more obvious somehow!
>>> 
>>> Does pathlib use scandir? If so, then maybe you get the caching
>>> benefits by using pathlib? And if pathlib doesn't use scandir, maybe
>>> it should? [I just checked, it looks like pathlib doesn't use scandir
>>> :-(]
>> 
>> Does pathlib even have a walk equivalent? (I know it has glob('**'), but that's not the same thing.)
>> 
>> Or are you suggesting that people should use path.iterdir with explicit recursion (or an explicit stack), and therefore just changing iterdir to use scandir (and prefill as many cached attribs as possible in each result) is what we want?
> 
> The main problem with having pathlib do any caching at all is that
> caching the results of stat calls implicitly in any context is a
> recipe for significant confusion, since you're at the mercy of race
> conditions as the filesystem changes out from underneath you. There
> also isn't an obviously "right" answer in the general case for cache
> invalidation, as in some cases you're interested in the file as it was
> when you originally opened it, and don't care if it got swapped out
> from underneath you, while in others you're interested in the file
> path, and want the filesystem details for right now, not the details
> from a few seconds ago.
> 
> For os.scandir(), we just delegate the behaviour to the underlying
> filesystem APIs - how readdir() and FindNextFile react to the
> directory contents changing during iteration is OS defined, and Python
> will inherit that variation (and may miss newly added files as a
> result).
> 
> The current os.walk() implementation constrains the scope of the
> scandir() filesystem state caching, since it doesn't let the DirEntry
> objects escape outside the generator

That's a good point. The fts functions that both BSD and GNU use to replace the ftw and the various other old *nix filesystem walk functions deal with this carefully; the short version is that any information you want to keep around from the current file after looking at the next file, you have to explicitly copy it, which makes it hard to confuse yourself about how up-to-date it is. (You can also go a directory at a time, more like os.walk, but with the same basic restriction: once you go to the next directory, the previous list of file entries is invalid.)

> - there's no need to ask yourself
> "What's the risk of stale filesystem data here?", since you're not
> getting access to the cached info in the first place, and hence always
> need to go query the filesystem directly.
> 
> This is a fairly universal pattern: for a given *application* you can
> likely figure out what to cache and when to invalidate it, even though
> those are unanswerable questions in the general case. Another example
> of that would be the stat caches in the current implementation of the
> import system, together with the corresponding need to call
> importlib.invalidate_caches() if you want to make sure the import
> system can see a module that was only just written to disk.
> 
> That's not to say that a general purpose directory walking utility
> producing DirEntry objects isn't an interesting prospect. Rather, it's
> an attempt to highlight that this is an area where there may be a
> significant gulf between "works for my use case" and "is a suitable
> addition to the standard library", particularly since this can now be
> a pure Python recipe atop os.scandir.

I still think providing an fts-like API instead of os.walk would be the clearest way to provide cached data (especially since people could look up nice generic documentation on fts). I don't think it would be too hard to emulate it (or a large enough subset--you can ask fts to return anything from just names to full stat structs, with well-defined performance characteristics for each combination of flags, and that probably can't be exactly the same) on top of scandir (or directly on Windows FindFirst/etc.), but then I said that when the PEP was being discussed and then never had time to actually try it... The bigger problem is that "you have to copy it, which is painful enough that you can't confuse yourself" is a much lower barrier to confusion in Python than in C, so it might not be as effective.