[Python-Dev] Updates to PEP 471, the os.scandir() proposal

Wed Jul 9 14:48:04 CEST 2014

> In this case because the names are exactly the same as the os versions which
> /do/ make a system call.

Fair enough.

> So if I'm finally understanding the root problem here:
>
>   - listdir returns a list of strings, one for each filename and one for
>     each directory, and keeps no other O/S supplied info.
>
>   - os.walk, which uses listdir, then needs to go back to the O/S and
>     refetch the thrown-away information
>
>   - so it's slow.
> ...
> and the new problem:
>
>   - not all O/Ses provide the same (or any) extra info about the
>     directory entries
>
> Have I got that right?

Yes, that's exactly right.

> If so, I still like the attribute idea better (surprise!), we just need to
> revisit the 'ensure_lstat' (or whatever it's called) parameter:  instead of
> a true/false value, it could have a scale:
>
>   - 0 = whatever the O/S gives us
>
>   - 1 = at least the is_dir/is_file (whatever the other normal one is),
>         and if the O/S doesn't give it to us for free than call lstat
>
>   - 2 = we want it all -- call lstat if necessary on this platform
>
> After all, the programmer should know up front how much of the extra info
> will be needed for the work that is trying to be done.

Yeah, I think this is a good idea to make option #2 a bit nicer. I
don't like the magic constants, and using constants like
os.SCANDIR_LSTAT is annoying, so how about using strings? I also
suggest calling the parameter "info" (because it determines what info
is returned), so you'd do scandir(path, info='type') if you need just
the is_X type information.

I also think it's nice to have a way for power users to "just return
what the OS gives us". However, I think making this the default is a
bad idea, as it's just asking for cross-platform bugs (and it's easy
to prevent).

Paul Moore basically agrees with this in his reply yesterday, though I
disagree with him it would be unfriendly to fail hard unless you asked
for the info -- quite the opposite, Linux users would think it very
unfriendly when your code broke because you didn't ask for the info.
:-)

So how about tweaking option #2 a tiny bit more to this:

def scandir(path='.', info=None, onerror=None): ...

* if info is None (the default), only the .name and .full_name
attributes are present
* if info is 'type', scandir ensures the is_dir/is_file/is_symlink
attributes are present and either True or False
* if info is 'lstat', scandir additionally ensures a .lstat is present
and is a full stat_result object
* if info is 'os', scandir returns the attributes the OS provides
(everything on Windows, only is_X -- most of the time -- on POSIX)

* if onerror is not None and errors occur during any internal lstat()
call, onerror(exc) is called with the OSError exception object

Further point -- because the is_dir/is_file/is_symlink attributes are
booleans, it would be very bad for them to be present but None if you
didn't ask for (or the OS didn't return) the type information. Because
then "if entry.is_dir:" would be None and your code would think it
wasn't a directory, when actually you don't know. For this reason, all
attributes should fail with AttributeError if not fetched.

> Thank you for writing scandir, and this PEP.  Excellent work.

Thanks!

> Oh, and +1 for option 2, slightly modified.  :)

With the above tweaks, I'm getting closer to being 50/50. It's
probably 60% #1 and 40% #2 for me now. :-)

Okay folks -- please respond: option #1 as per the current PEP 471, or
option #2 with Ethan's multi-level thing tweaks as per the above?

-Ben