[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

Nick Coghlan ncoghlan at gmail.com
Sun Jun 29 14:28:14 CEST 2014


On 29 June 2014 21:45, Paul Moore <p.f.moore at gmail.com> wrote:
> On 29 June 2014 12:08, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> This is what makes me wary of including lstat, even though Windows
>> offers it without the extra stat call. Caching behaviour is *really*
>> hard to make intuitive, especially when it *sometimes* returns data
>> that looks fresh (as it on first call on POSIX systems).
>
> If it matters that much we *could* simply call it cached_lstat(). It's
> ugly, but I really don't like the idea of throwing the information
> away - after all, the fact that we currently throw data away is why
> there's even a need for scandir. Let's not make the same mistake
> again...

Future-proofing is the reason DirEntry is a full fledged class in the
first place, though.

Effectively communicating the behavioural difference between DirEntry
and pathlib.Path is the main thing that makes me nervous about
adhering too closely to the Path API.

To restate the problem and the alternative proposal, these are the
DirEntry methods under discussion:

    is_dir(): like os.path.isdir(), but requires no system calls on at
least POSIX and Windows
    is_file(): like os.path.isfile(), but requires no system calls on
at least POSIX and Windows
    is_symlink(): like os.path.islink(), but requires no system calls
on at least POSIX and Windows
    lstat(): like os.lstat(), but requires no system calls on Windows

For the almost-certain-to-be-cached items, the suggestion is to make
them properties (or just ordinary attributes):

    is_dir
    is_file
    is_symlink

What do with lstat() is currently less clear, since POSIX directory
scanning doesn't provide that level of detail by default.

The PEP also doesn't currently state whether the is_dir(), is_file()
and is_symlink() results would be updated if a call to lstat()
produced different answers than the original directory scanning
process, which further suggests to me that allowing the stat call to
be delayed on POSIX systems is a potentially problematic and
inherently confusing design. We would have two options:

- update them, meaning calling lstat() may change those results from
being a snapshot of the setting at the time the directory was scanned
- leave them alone, meaning the DirEntry object and the
DirEntry.lstat() result may give different answers

Those both sound ugly to me.

So, here's my alternative proposal: add an "ensure_lstat" flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.

That would make the DirEntry attributes:

    is_dir: boolean, always populated
    is_file: boolean, always populated
    is_symlink boolean, always populated
    lstat_result: stat result, may be None on POSIX systems if
ensure_lstat is False

(I'm not particularly sold on "lstat_result" as the name, but "lstat"
reads as a verb to me, so doesn't sound right as an attribute name)

What this would allow:

- by default, scanning is efficient everywhere, but lstat_result may
be None on POSIX systems
- if you always need the lstat result, setting "ensure_lstat" will
trigger the extra system call implicitly
- if you only sometimes need the stat result, you can call os.lstat()
explicitly when the DirEntry lstat attribute is None

Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.

There'd still be a slight window of discrepancy (since the filesystem
state may change between reading the directory entry and making the
lstat() call), but this could be effectively eliminated from the
perspective of the Python code by making the result of the lstat()
call authoritative for the whole DirEntry object.

Regards,
Nick.

P.S. We'd be generating quite a few of these, so we can use __slots__
to keep the memory overhead to a minimum (that's just a general
comment - it's really irrelevant to the methods-or-attributes
question).


-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list