[Python-Dev] Updates to PEP 471, the os.scandir() proposal

Wed Jul 9 03:08:03 CEST 2014

> I did better than that -- I read the whole thing!  ;)

Thanks. :-)

> -1 on the PEP's implementation.
>
> Just like an attribute does not imply a system call, having a
> method named 'is_dir' /does/ imply a system call, and not
> having one can be just as misleading.

Why does a method imply a system call? os.path.join() and str.lower()
don't make system calls. Isn't it just a matter of clear
documentation? Anyway -- less philosophical discussion below.

> If we have this:
>
>     size = 0
>     for entry in scandir('/some/path'):
>         size += entry.st_size
>
>   - on Windows, this should Just Work (if I have the names correct ;)
>   - on Posix, etc., this should fail noisily with either an AttributeError
>     ('entry' has no 'st_size') or a TypeError (cannot add None)
>
> and the solution is equally simple:
>
>     for entry in scandir('/some/path', stat=True):
>
>   - if not Windows, perform a stat call at the same time

I'm not totally opposed to this, which is basically a combination of
Nick Coghlan's and Paul Moore's recent proposals mentioned in the PEP.
However, as discussed on python-dev, there are some edge cases it
doesn't handle very well, and it's messier to handle errors (requires
onerror as you mention below).

I presume you're suggesting that is_dir/is_file/is_symlink should be
regular attributes, and accessing them should never do a system call.
But what if the system doesn't support d_type (eg: Solaris) or the
d_type value is DT_UNKNOWN (can happen on Linux, OS X, BSD)? The
options are:

1) scandir() would always call lstat() in the case of missing/unknown
d_type. If so, scandir() is actually more expensive than listdir(),
and as a result it's no longer safe to implement listdir in terms of
scandir:

def listdir(path='.'):
    return [e.name for e in scandir(path)]

2) Or would it be better to have another flag like scandir(path,
type=True) to ensure the is_X type info is fetched? This is explicit,
but also getting kind of unwieldly.

3) A third option is for the is_X attributes to be absent in this case
(hasattr tests required, and the user would do the lstat manually).
But as I noted on python-dev recently, you basically always want is_X,
so this leads to unwieldly and code that's twice as long as it needs
to be. See here:
https://mail.python.org/pipermail/python-dev/2014-July/135312.html

4) I gather in your proposal above, scandir will call lstat() if
stat=True? Except where does it put the values? Surely it should
return an existing stat_result object, rather than stuffing everything
onto the DirEntry, or throwing away some values on Linux? In this
case, I'd prefer Nick Coghlan's approach of ensure_lstat and a
.stat_result attribute. However, this still has the "what if d_type is
missing or DT_UNKNOWN" issue.

It seems to me that making is_X() methods handles this exact scenario
-- methods are so you don't have to do the dirty work.

So yes, the real world is messy due to missing is_X values, but I
think it's worth getting this right, and is_X() methods can do this
while keeping the API simple and cross-platform.

> Now, of course, we might get errors.  I am not a big fan of wrapping everything in try/except, particularly when we already have a model to follow -- os.walk:

I don't mind the onerror too much if we went with this kind of
approach. It's not quite as nice as a standard try/except around the
method call, but it's definitely workable and has a precedent with
os.walk().

It seems a bit like we're going around in circles here, and I think we
have all the information and options available to us, so I'm going to
SUMMARIZE.

We have a choice before us, a fork in the road. :-) We can choose one
of these options for the scandir API:

1) The current PEP 471 approach. This solves the issue with d_type
being missing or DT_UNKNOWN, it doesn't require onerror, and it's a
really tidy API that doesn't explode with AttributeErrors if you write
code on Windows (without thinking too hard) and then move to Linux. I
think all of these points are important -- the cross-platform one not
the least, because we want to make it easy, even *trivial*, for people
to write cross-platform code.

For reference, here's what get_tree_size() looks like with this
approach, not including error handling with try/except:

def get_tree_size(path):
    total = 0
    for entry in os.scandir(path):
        if entry.is_dir():
            total += get_tree_size(entry.full_name)
        else:
            total += entry.lstat().st_size
    return total

2) Nick Coghlan's model of only fetching the lstat value if
ensure_lstat=True, and including an onerror callback for error
handling when scandir calls lstat internally. However, as described,
we'd also need an ensure_type=True option, so that scandir() isn't way
slower than listdir() if you actually don't want the is_X values and
d_type is missing/unknown.

For reference, here's what get_tree_size() looks like with this
approach, not including error handling with onerror:

def get_tree_size(path):
    total = 0
    for entry in os.scandir(path, ensure_type=True, ensure_lstat=True):
        if entry.is_dir:
            total += get_tree_size(entry.full_name)
        else:
            total += entry.lstat_result.st_size
    return total

I'm fairly strongly in favour of approach #1, but I wouldn't die if
everyone else thinks the benefits of #2 outweigh the somewhat less
nice API.

Comments and votes, please!

-Ben