[Python-Dev] My summary of the scandir (PEP 471)

Wed Jul 2 14:41:28 CEST 2014

Thanks for the effort in your response, Paul.

I'm all for KISS, but let's just slow down a bit here.

> I think that thin wrapper is needed - even
> if the various bells and whistles are useful, they can be built on top
> of a low-level version (whereas the converse is not the case).

Yes, but API design is important. For example, urllib2 has a kind of
the "thin wrapper approach", but millions of people use the 3rd-party
"requests" library because it's just so much nicer to use.

There are low-level functions in the "os" module, but there are also a
lot of higher-level functions (os.walk) and functions that smooth over
cross-platform issues (os.stat).

Detailed comments below.

> The return value is an object whose attributes correspond to the data
> the OS returns about a directory entry:
>
>   * name - the object's name
>   * full_name - the object's full name (including path)
>   * is_dir - whether the object is a directory
>   * is file - whether the object is a plain file
>   * is_symlink - whether the object is a symbolic link
>
> On Windows, the following attributes are also available
>
>   * st_size - the size, in bytes, of the object (only meaningful for files)
>   * st_atime - time of last access
>   * st_mtime - time of last write
>   * st_ctime - time of creation
>   * st_file_attributes - Windows file attribute bits (see the
> FILE_ATTRIBUTE_* constants in the stat module)

Again, this seems like a nice simple idea, but I think it's actually a
worst-of-both-worlds solution -- it has a few problems:

1) It's a nasty API to actually write code with. If you try to use it,
it gives off a "made only for low-level library authors" rather than
"designed for developers" smell. For example, here's a get_tree_size()
function I use written in both versions (original is the PEP 471
version with the addition of .full_name):

def get_tree_size_original(path):
    """Return total size of all files in directory tree at path."""
    total = 0
    for entry in os.scandir(path):
        if entry.is_dir():
            total += get_tree_size_original(entry.full_name)
        else:
            total += entry.lstat().st_size
    return total

def get_tree_size_new(path):
    """Return total size of all files in directory tree at path."""
    total = 0
    for entry in os.scandir(path):
        if hasattr(entry, 'is_dir') and hasattr(entry, 'st_size'):
            is_dir = entry.is_dir
            size = entry.st_size
        else:
            st = os.lstat(entry.full_name)
            is_dir = stat.S_ISDIR(st.st_mode)
            size = st.st_size
        if is_dir:
            total += get_tree_size_new(entry.full_name)
        else:
            total += size
    return total

I know which version I'd rather write and maintain! It seems to me new
users and folks new to Python could easily write the top version, but
the bottom is longer, more complicated, and harder to get right. It
would also be very easy to write code in a way that works on Windows
but bombs hard on POSIX.

2) It seems like your assumption is that is_dir/is_file/is_symlink are
always available on POSIX via readdir. This isn't actually the case
(this was discussed in the original threads) -- if readdir() returns
dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat()
anyway to get it. So, as the above definition of get_tree_size_new()
shows, you have to use getattr/hasattr on everything:
is_dir/is_file/is_symlink as well as the st_* attributes.

3) It's not much different in concept to the PEP 471 version, except
that PEP 471 has a built-in .lstat() method, making the user's life
much easier. This is the sense in which it's the worst of both worlds
-- it's a far less nice API to use, but it still has the same issues
with race conditions the original does.

So thinking about this again:

First, based on the +1's to Paul's new solution, I don't think people
are too concerned about the race condition issue (attributes being
different between the original readdir and the os.stat calls). I think
this is probably fair -- if folks care, they can handle it in an
application-specific way. So that means Paul's new solution and the
original PEP 471 approach are both okay on that score.

Second, comparing PEP 471 to Nick's solution: error handling is much
more straight-forward and simple to document with the original PEP 471
approach (just try/catch around the function calls) than with Nick's
get_lstat=True approach of doing the stat() if needed inside the
iterator. To catch errors with that approach, you'd either have to do
a "while True" loop and try/catch around next(it) manually (which is
very yucky code), or we'd have to add an onerror callback, which is
somewhat less nice to use and harder to document (signature of the
callback, exception object, etc).

So given all of the above, I'm fairly strongly in favour of the
approach in the original PEP 471 due to it's easy-to-use API and
straight-forward try/catch approach to error handling. (My second
option would be Nick's get_lstat=True with the onerror callback. My
third option would be Paul's attribute-only solution, as it's just
very hard to use.)

Thoughts?

-Ben