[Python-Dev] My summary of the scandir (PEP 471)

Tue Jul 1 15:00:32 CEST 2014

Thanks for spinning this off to (hopefully) finished the discussion. I
agree it's nearly time to update the PEP.

> @Ben: it's time to update your PEP to complete it with this
> discussion! IMO DirEntry must be as simple as possible and portable:
>
> - os.scandir(str)
> - DirEntry.lstat_result object only available on Windows, same result
> than os.lstat()
> - DirEntry.fullname(): os.path.join(directory, DirEntry.name), where
> directory would be an hidden attribute of DirEntry

I'm quite strongly against this, and I think it's actually the worst
of both worlds. It is not as good an API because:

(a) it doesn't call stat for you (on POSIX), so you have to check an
attribute and call scandir manually if you need it, turning what
should be one line of code into four. Your proposal above was kind of
how I had it originally, where you had to do extra tests and call
scandir manually if you needed it (see
https://mail.python.org/pipermail/python-dev/2013-May/126119.html)
(b) the .lstat_result attribute is available on Windows but not on
POSIX, meaning it's very easy for Windows developers to write code
that will run and work fine on Windows, but then break horribly on
POSIX; I think it'd be better if it broke hard on Windows to make
writing cross-platform code easy

The two alternates are:

1) the original proposal in the current version of PEP 471, where
DirEntry has an .lstat() method which calls stat() on POSIX but is
free on Windows
2) Nick Coghlan's proposal on the previous thread
(https://mail.python.org/pipermail/python-dev/2014-June/135261.html)
suggesting an ensure_lstat keyword param to scandir if you need the
lstat_result value

I would make one small tweak to Nick Coghlan's proposal to make
writing cross-platform code easier. Instead of .lstat_result being
None sometimes (on POSIX), have it None always unless you specify
ensure_lstat=True. (Actually, call it get_lstat=True to kind of make
this more obvious.) Per (b) above, this means Windows developers
wouldn't accidentally write code which failed on POSIX systems -- it'd
fail fast on Windows too if you accessed .lstat_result without
specifying get_lstat=True.

I'm still unsure which of these I like better. I think #1's API is
slightly nicer without the ensure_lstat parameter, and error handling
of the stat() is more explicit. But #2 always fetches the stat info at
the same time as the dir entry info, so eliminates the problem of
having the file info change between scandir iteration and the .lstat()
call.

I'm leaning towards preferring #2 (Nick's proposal) because it solves
or gets around the caching issue. My one concern is error handling. Is
it an issue if scandir's __next__ can raise an OSError either from the
readdir() call or the call to stat()? My thinking is probably not. In
practice, would it ever really happen that readdir() would succeed but
an os.stat() immediately after would fail? I guess it could if the
file is deleted, but then if it were deleted a microsecond earlier the
readdir() would fail anyway, or not? Or does readdir give you a
consistent, "snap-shotted" view on things?

The one other thing I'm not quite sure about with Nick's proposal is
the name .lstat_result, as it's long. I can see why he suggested that,
as .lstat sounds like a verb, but maybe that's okay? If we can have
.is_dir and .is_file as attributes, my thinking is an .lstat attribute
is fine too. I don't feel too strongly though.

> - I don't think that we should support scandir(bytes). If you really
> want to support os.scandir(bytes), it must raise an error on Windows
> since bytes filename are already deprecated. It wouldn't make sense to
> add new function with a deprecated feature. Since we have the PEP 383
> (surrogateescape), it's better to advice to use Unicode on all
> platforms. Almost all Python functions are able to encode back Unicode
> filename automatically. Use os.fsencode() to encode manually if needd.

Really, are bytes filenames deprecated? I think maybe they should be,
as they don't work on Windows :-), but the latest Python "os" docs
(https://docs.python.org/3.5/library/os.html) still say that all
functions that accept path names accept either str or bytes, and
return a value of the same type where necessary. So I think scandir()
should do the same thing.

> - We may not define a DirEntry.fullname() method: the directory name
> is usually well known. Ok, but every time that I use os.listdir(), I
> write os.path.join(directory, name) because in some cases I want the
> full path.

Agreed. I use this a lot too. However, I'd prefer a .fullname
attribute rather than a method, as it's free/cheap to compute and
doesn't require OS calls.

Out of interest, why do we have .is_dir and .stat_result but .fullname
rather than .full_name? .fullname seems reasonable to me, but maybe
consistency is a good thing here?

> - It must not be possible to "refresh" a DirEntry object. Call
> os.stat(entry.fullname()) or pathlib.Path(entry.fullname()) to get
> fresh data. DirEntry is only computed once, that's all. It's well
> defined.

I agree refresh() is not needed -- just use os.stat() or pathlib.

> - No Windows wildcard, you wrote that the feature has many corner
> cases, and it's only available on Windows. It's easy to combine
> scandir with fnmatch.

Agreed.

-Ben