PEP471 - (os.scandir())

PEP471 introduces a faster way of doing low-level directory traversal which is then used to implement and speed up the higher-level API os.walk() - which for me at least is the "go to API" for most directory scanning code I write. However, when using os.walk() the first thing that one tends to do with the results is to analyse them in some way (look at file sizes, datestamps and other things that stat() returns) which is exactly the information that os.scandir() is caching and speeding up but which is then thrown away in order to emulate os.walk()'s original name-based API (well, name and type as the directory/file distinction is also there). So, I'd like to suggest an os.walk()-like API that returns the os.scandir() DirEntry structures rather than names (*). I have my own local version that's just a copy of os.walk() that appends "entry" rather than "entry.name" to the returned lists, but that's a nasty way of achieving this. How to do it - os.walk() "direntries=True" keyword? os.walkentries() function? Something else better than those? Regards, E. (*) I have studied the PEP, followed a lot of the references and looked at the 3.5.0 implementation. I can't see that I've missed such a thing already existing, but it's possible. If so, perhaps this is instead a request to make that thing more obvious somehow!

On 26 November 2015 at 23:22, Erik <python@lucidity.plus.com> wrote:
Does pathlib use scandir? If so, then maybe you get the caching benefits by using pathlib? And if pathlib doesn't use scandir, maybe it should? [I just checked, it looks like pathlib doesn't use scandir :-(] Paul

On 27 November 2015 at 18:32, Paul Moore <p.f.moore@gmail.com> wrote:
Never mind - see https://www.python.org/dev/peps/pep-0471/#return-values-being-pathlib-path-o... Pathlib objects must not cache the results of stat calls, so they cannot use scandir. Paul

On Nov 27, 2015, at 10:32, Paul Moore <p.f.moore@gmail.com> wrote:
Does pathlib even have a walk equivalent? (I know it has glob('**'), but that's not the same thing.) Or are you suggesting that people should use path.iterdir with explicit recursion (or an explicit stack), and therefore just changing iterdir to use scandir (and prefill as many cached attribs as possible in each result) is what we want?

On 28 November 2015 at 04:42, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
The main problem with having pathlib do any caching at all is that caching the results of stat calls implicitly in any context is a recipe for significant confusion, since you're at the mercy of race conditions as the filesystem changes out from underneath you. There also isn't an obviously "right" answer in the general case for cache invalidation, as in some cases you're interested in the file as it was when you originally opened it, and don't care if it got swapped out from underneath you, while in others you're interested in the file path, and want the filesystem details for right now, not the details from a few seconds ago. For os.scandir(), we just delegate the behaviour to the underlying filesystem APIs - how readdir() and FindNextFile react to the directory contents changing during iteration is OS defined, and Python will inherit that variation (and may miss newly added files as a result). The current os.walk() implementation constrains the scope of the scandir() filesystem state caching, since it doesn't let the DirEntry objects escape outside the generator - there's no need to ask yourself "What's the risk of stale filesystem data here?", since you're not getting access to the cached info in the first place, and hence always need to go query the filesystem directly. This is a fairly universal pattern: for a given *application* you can likely figure out what to cache and when to invalidate it, even though those are unanswerable questions in the general case. Another example of that would be the stat caches in the current implementation of the import system, together with the corresponding need to call importlib.invalidate_caches() if you want to make sure the import system can see a module that was only just written to disk. That's not to say that a general purpose directory walking utility producing DirEntry objects isn't an interesting prospect. Rather, it's an attempt to highlight that this is an area where there may be a significant gulf between "works for my use case" and "is a suitable addition to the standard library", particularly since this can now be a pure Python recipe atop os.scandir. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Nov 29, 2015, at 22:45, Nick Coghlan <ncoghlan@gmail.com> wrote:
That's a good point. The fts functions that both BSD and GNU use to replace the ftw and the various other old *nix filesystem walk functions deal with this carefully; the short version is that any information you want to keep around from the current file after looking at the next file, you have to explicitly copy it, which makes it hard to confuse yourself about how up-to-date it is. (You can also go a directory at a time, more like os.walk, but with the same basic restriction: once you go to the next directory, the previous list of file entries is invalid.)
I still think providing an fts-like API instead of os.walk would be the clearest way to provide cached data (especially since people could look up nice generic documentation on fts). I don't think it would be too hard to emulate it (or a large enough subset--you can ask fts to return anything from just names to full stat structs, with well-defined performance characteristics for each combination of flags, and that probably can't be exactly the same) on top of scandir (or directly on Windows FindFirst/etc.), but then I said that when the PEP was being discussed and then never had time to actually try it... The bigger problem is that "you have to copy it, which is painful enough that you can't confuse yourself" is a much lower barrier to confusion in Python than in C, so it might not be as effective.

On 30 November 2015 at 09:12, Andrew Barnert <abarnert@yahoo.com> wrote:
I still think providing an fts-like API instead of os.walk would be the clearest way to provide cached data (especially since people could look up nice generic documentation on fts).
As a Windows user I'm not familiar with fts (and Google didn't come up with anything obvious). So I'm not sure how true "people could look up generic docuimentation" would be in practice. But from your description it may be useful - I presume it's something that could be built as a 3rd party library based on os.scandir, at least as an initial proof of concept? Paul

On Nov 30, 2015, at 01:40, Paul Moore <p.f.moore@gmail.com> wrote:
The perils of acronym-based naming; it's very easy to go from one of two meaningful search results to way down the list just because UrbanDictionary popularized some txt speak slang and wikipedia started covering every government agency in the world with its name translated to English... So you're right, that benefit no longer applies. You can still find "man fts" very easily, but that isn't what people would be looking for, and doesn't find any of the user-friendly tutorials, just the manpage.
A complete implementation that supported all of the flags and maintained the appropriate performance guarantees might be hard. But a partial implementation that supports just the most common flags and falls back to "stat everything, sometimes twice" as a proof of concept should be doable, once I get some free time.

On 26 November 2015 at 23:22, Erik <python@lucidity.plus.com> wrote:
Does pathlib use scandir? If so, then maybe you get the caching benefits by using pathlib? And if pathlib doesn't use scandir, maybe it should? [I just checked, it looks like pathlib doesn't use scandir :-(] Paul

On 27 November 2015 at 18:32, Paul Moore <p.f.moore@gmail.com> wrote:
Never mind - see https://www.python.org/dev/peps/pep-0471/#return-values-being-pathlib-path-o... Pathlib objects must not cache the results of stat calls, so they cannot use scandir. Paul

On Nov 27, 2015, at 10:32, Paul Moore <p.f.moore@gmail.com> wrote:
Does pathlib even have a walk equivalent? (I know it has glob('**'), but that's not the same thing.) Or are you suggesting that people should use path.iterdir with explicit recursion (or an explicit stack), and therefore just changing iterdir to use scandir (and prefill as many cached attribs as possible in each result) is what we want?

On 28 November 2015 at 04:42, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
The main problem with having pathlib do any caching at all is that caching the results of stat calls implicitly in any context is a recipe for significant confusion, since you're at the mercy of race conditions as the filesystem changes out from underneath you. There also isn't an obviously "right" answer in the general case for cache invalidation, as in some cases you're interested in the file as it was when you originally opened it, and don't care if it got swapped out from underneath you, while in others you're interested in the file path, and want the filesystem details for right now, not the details from a few seconds ago. For os.scandir(), we just delegate the behaviour to the underlying filesystem APIs - how readdir() and FindNextFile react to the directory contents changing during iteration is OS defined, and Python will inherit that variation (and may miss newly added files as a result). The current os.walk() implementation constrains the scope of the scandir() filesystem state caching, since it doesn't let the DirEntry objects escape outside the generator - there's no need to ask yourself "What's the risk of stale filesystem data here?", since you're not getting access to the cached info in the first place, and hence always need to go query the filesystem directly. This is a fairly universal pattern: for a given *application* you can likely figure out what to cache and when to invalidate it, even though those are unanswerable questions in the general case. Another example of that would be the stat caches in the current implementation of the import system, together with the corresponding need to call importlib.invalidate_caches() if you want to make sure the import system can see a module that was only just written to disk. That's not to say that a general purpose directory walking utility producing DirEntry objects isn't an interesting prospect. Rather, it's an attempt to highlight that this is an area where there may be a significant gulf between "works for my use case" and "is a suitable addition to the standard library", particularly since this can now be a pure Python recipe atop os.scandir. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Nov 29, 2015, at 22:45, Nick Coghlan <ncoghlan@gmail.com> wrote:
That's a good point. The fts functions that both BSD and GNU use to replace the ftw and the various other old *nix filesystem walk functions deal with this carefully; the short version is that any information you want to keep around from the current file after looking at the next file, you have to explicitly copy it, which makes it hard to confuse yourself about how up-to-date it is. (You can also go a directory at a time, more like os.walk, but with the same basic restriction: once you go to the next directory, the previous list of file entries is invalid.)
I still think providing an fts-like API instead of os.walk would be the clearest way to provide cached data (especially since people could look up nice generic documentation on fts). I don't think it would be too hard to emulate it (or a large enough subset--you can ask fts to return anything from just names to full stat structs, with well-defined performance characteristics for each combination of flags, and that probably can't be exactly the same) on top of scandir (or directly on Windows FindFirst/etc.), but then I said that when the PEP was being discussed and then never had time to actually try it... The bigger problem is that "you have to copy it, which is painful enough that you can't confuse yourself" is a much lower barrier to confusion in Python than in C, so it might not be as effective.

On 30 November 2015 at 09:12, Andrew Barnert <abarnert@yahoo.com> wrote:
I still think providing an fts-like API instead of os.walk would be the clearest way to provide cached data (especially since people could look up nice generic documentation on fts).
As a Windows user I'm not familiar with fts (and Google didn't come up with anything obvious). So I'm not sure how true "people could look up generic docuimentation" would be in practice. But from your description it may be useful - I presume it's something that could be built as a 3rd party library based on os.scandir, at least as an initial proof of concept? Paul

On Nov 30, 2015, at 01:40, Paul Moore <p.f.moore@gmail.com> wrote:
The perils of acronym-based naming; it's very easy to go from one of two meaningful search results to way down the list just because UrbanDictionary popularized some txt speak slang and wikipedia started covering every government agency in the world with its name translated to English... So you're right, that benefit no longer applies. You can still find "man fts" very easily, but that isn't what people would be looking for, and doesn't find any of the user-friendly tutorials, just the manpage.
A complete implementation that supported all of the flags and maintained the appropriate performance guarantees might be hard. But a partial implementation that supports just the most common flags and falls back to "stat everything, sometimes twice" as a proof of concept should be doable, once I get some free time.
participants (6)
-
Andrew Barnert
-
Eric Fahlgren
-
Erik
-
MRAB
-
Nick Coghlan
-
Paul Moore