[Python-Dev] Updates to PEP 471, the os.scandir() proposal

Ethan Furman ethan at stoneleaf.us
Wed Jul 9 03:31:55 CEST 2014

On 07/08/2014 06:08 PM, Ben Hoyt wrote:
>> Just like an attribute does not imply a system call, having a
>> method named 'is_dir' /does/ imply a system call, and not
>> having one can be just as misleading.
> Why does a method imply a system call? os.path.join() and str.lower()
> don't make system calls. Isn't it just a matter of clear
> documentation? Anyway -- less philosophical discussion below.

In this case because the names are exactly the same as the os versions which /do/ make a system call.

> I presume you're suggesting that is_dir/is_file/is_symlink should be
> regular attributes, and accessing them should never do a system call.
> But what if the system doesn't support d_type (eg: Solaris) or the
> d_type value is DT_UNKNOWN (can happen on Linux, OS X, BSD)? The
> options are:

So if I'm finally understanding the root problem here:

   - listdir returns a list of strings, one for each filename and one for
     each directory, and keeps no other O/S supplied info.

   - os.walk, which uses listdir, then needs to go back to the O/S and
     refetch the thrown-away information

   - so it's slow.

The solution:

   - have scandir /not/ throw away the O/S supplied info

and the new problem:

   - not all O/Ses provide the same (or any) extra info about the
     directory entries

Have I got that right?

If so, I still like the attribute idea better (surprise!), we just need to revisit the 'ensure_lstat' (or whatever it's 
called) parameter:  instead of a true/false value, it could have a scale:

   - 0 = whatever the O/S gives us

   - 1 = at least the is_dir/is_file (whatever the other normal one is),
         and if the O/S doesn't give it to us for free than call lstat

   - 2 = we want it all -- call lstat if necessary on this platform

After all, the programmer should know up front how much of the extra info will be needed for the work that is trying to 
be done.

> We have a choice before us, a fork in the road. :-) We can choose one
> of these options for the scandir API:
> 1) The current PEP 471 approach. This solves the issue with d_type
> being missing or DT_UNKNOWN, it doesn't require onerror, and it's a
> really tidy API that doesn't explode with AttributeErrors if you write
> code on Windows (without thinking too hard) and then move to Linux. I
> think all of these points are important -- the cross-platform one not
> the least, because we want to make it easy, even *trivial*, for people
> to write cross-platform code.

Yes, but we don't want a function that sucks equally on all platforms.  ;)

> 2) Nick Coghlan's model of only fetching the lstat value if
> ensure_lstat=True, and including an onerror callback for error
> handling when scandir calls lstat internally. However, as described,
> we'd also need an ensure_type=True option, so that scandir() isn't way
> slower than listdir() if you actually don't want the is_X values and
> d_type is missing/unknown.

With the multi-level version of 'ensure_lstat' we do not need an extra 'ensure_type'.

For reference, here's what get_tree_size() looks like with this approach, not including error handling with onerror:

   def get_tree_size(path):
        total = 0
        for entry in os.scandir(path, ensure_lstat=1):
            if entry.is_dir:
                total += get_tree_size(entry.full_name)
                total += entry.lstat_result.st_size
        return total

And if we added the onerror here it would be a line fragment, as opposed to the extra four lines (at least) for the 
try/except in the first example (which I cut).


Thank you for writing scandir, and this PEP.  Excellent work.

Oh, and +1 for option 2, slightly modified.  :)


More information about the Python-Dev mailing list