[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

Sun Jun 29 06:59:19 CEST 2014

On 29 June 2014 05:48, Ben Hoyt <benhoyt at gmail.com> wrote:
>>> But the underlying system calls -- ``FindFirstFile`` /
>>> ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
>>
>> What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?
>
> I guess it'd be better to say "Windows" and "Unix-based OSs"
> throughout the PEP? Because all of these (including Mac OS X) are
> Unix-based.

*nix and POSIX-based are the two conventions I use.

>> Crazy idea: would it be possible to "convert" a DirEntry object to a
>> pathlib.Path object without losing the cache? I guess that
>> pathlib.Path expects a full  stat_result object.
>
> The main problem is that pathlib.Path objects explicitly don't cache
> stat info (and Guido doesn't want them to, for good reason I think).
> There's a thread on python-dev about this earlier. I'll add it to a
> "Rejected ideas" section.

The key problem with caches on pathlib.Path objects is that you could
end up with two separate path objects that referred to the same
filesystem location but returned different answers about the
filesystem state because their caches might be stale. DirEntry is
different, as the content is generally *assumed* to be stale
(referring to when the directory was scanned, rather than the current
filesystem state). DirEntry.lstat() on POSIX systems will be an
exception to that general rule (referring to the time of first lookup,
rather than when the directory was scanned, so the answer rom lstat()
may be inconsistent with other data stored directly on the DirEntry
object), but one we can probably live with.

More generally, as part of the pathlib PEP review, we figured out that
a *per-object* cache of filesystem state would be an inherently bad
idea, but a string based *process global* cache might make sense for
modules like walkdir (not part of the stdlib - it's an iterator
pipeline based approach to file tree scanning I wrote a while back,
that currently suffers badly from the performance impact of repeated
stat calls at different stages of the pipeline). We realised this was
getting into a space where application and library specific concerns
are likely to start affecting the caching design, though, so the
current status of standard library level stat caching is "it's not
clear if there's an available approach that would be sufficiently
general purpose to be appropriate for inclusion in the standard
library".

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia