[Python-Dev] My summary of the scandir (PEP 471)

Wed Jul 2 15:48:12 CEST 2014

tl;dr - I agree with your points and think that the original PEP 471
proposal is fine. The details here are just clarification of why my
proposal wasn't just "use PEP 471 as written" in the first place...

On 2 July 2014 13:41, Ben Hoyt <benhoyt at gmail.com> wrote:
> 1) It's a nasty API to actually write code with. If you try to use it,
> it gives off a "made only for low-level library authors" rather than
> "designed for developers" smell. For example, here's a get_tree_size()
> function I use written in both versions (original is the PEP 471
> version with the addition of .full_name):
>
> def get_tree_size_original(path):
>     """Return total size of all files in directory tree at path."""
>     total = 0
>     for entry in os.scandir(path):
>         if entry.is_dir():
>             total += get_tree_size_original(entry.full_name)
>         else:
>             total += entry.lstat().st_size
>     return total
>
> def get_tree_size_new(path):
>     """Return total size of all files in directory tree at path."""
>     total = 0
>     for entry in os.scandir(path):
>         if hasattr(entry, 'is_dir') and hasattr(entry, 'st_size'):
>             is_dir = entry.is_dir
>             size = entry.st_size
>         else:
>             st = os.lstat(entry.full_name)
>             is_dir = stat.S_ISDIR(st.st_mode)
>             size = st.st_size
>         if is_dir:
>             total += get_tree_size_new(entry.full_name)
>         else:
>             total += size
>     return total
>
> I know which version I'd rather write and maintain!

Fair point. But *only* because is_dir isn't guaranteed to be
available. I could debate other aspects of your translation to use my
API, but it's not relevant as my proposal was flawed in terms of
is_XXX.

> It seems to me new
> users and folks new to Python could easily write the top version, but
> the bottom is longer, more complicated, and harder to get right.

Given the is_dir point, agreed.

> It
> would also be very easy to write code in a way that works on Windows
> but bombs hard on POSIX.

You may have a point here - my Windows bias may be showing. It's
already awfully easy to write code that works on POSIX but bombs hard
on Windows (deleting open files, for example) so I find it tempting to
think "give them a taste of their own medicine" :-)

More seriously, it seems to me that the scandir API is specifically
designed to write efficient code on platforms where the OS gives
information that allows you to do so. Warping the API too much to
cater for platforms where that isn't possible seems to have the
priorities backwards. Making the API not be an accident waiting to
happen is fine, though.

And let's be careful, too. My position is that it's not too hard to
write code that works on Windows, Linux and OS X but you're right you
could miss the problem with platforms that don't even support a free
is_dir(). It's *easier* to write Windows-only code by mistake, but the
fix to cover the "big three" is pretty simple (if not hasattr, lstat).

> 2) It seems like your assumption is that is_dir/is_file/is_symlink are
> always available on POSIX via readdir. This isn't actually the case
> (this was discussed in the original threads) -- if readdir() returns
> dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat()
> anyway to get it. So, as the above definition of get_tree_size_new()
> shows, you have to use getattr/hasattr on everything:
> is_dir/is_file/is_symlink as well as the st_* attributes.

Ah, the wording in the PEP says "Linux, Windows, OS X". Superficially,
that said "everywhere" to me. It might be worth calling out
specifically some examples where it's not available without an extra
system call, just to make the point explicit.

You're right, though, that blows away the simplicity of my proposal.
The original PEP 471 seems precisely right to me, in that case.

I was only really arguing for attributes because they seem more
obviously static than a method call. And personally I don't care about
that aspect.

> 3) It's not much different in concept to the PEP 471 version, except
> that PEP 471 has a built-in .lstat() method, making the user's life
> much easier. This is the sense in which it's the worst of both worlds
> -- it's a far less nice API to use, but it still has the same issues
> with race conditions the original does.

Agreed. My intent was never to remove the race conditions, I see them
as the responsibility of the application to consider (many
applications simply won't care, and those that do will likely want a
specific solution, not a library-level compromise).

> So thinking about this again:
>
> First, based on the +1's to Paul's new solution, I don't think people
> are too concerned about the race condition issue (attributes being
> different between the original readdir and the os.stat calls). I think
> this is probably fair -- if folks care, they can handle it in an
> application-specific way. So that means Paul's new solution and the
> original PEP 471 approach are both okay on that score.

+1. That was my main point, in actual fact

> Second, comparing PEP 471 to Nick's solution: error handling is much
> more straight-forward and simple to document with the original PEP 471
> approach (just try/catch around the function calls) than with Nick's
> get_lstat=True approach of doing the stat() if needed inside the
> iterator. To catch errors with that approach, you'd either have to do
> a "while True" loop and try/catch around next(it) manually (which is
> very yucky code), or we'd have to add an onerror callback, which is
> somewhat less nice to use and harder to document (signature of the
> callback, exception object, etc).

Agreed. If my solution had worked, it would have been by isolating a
few extra cases where you could guarantee errors won't happen. But
actually, errors *can* happen in those cases, on certain systems. So
PEP 471 wins on all counts here too.

> So given all of the above, I'm fairly strongly in favour of the
> approach in the original PEP 471 due to it's easy-to-use API and
> straight-forward try/catch approach to error handling. (My second
> option would be Nick's get_lstat=True with the onerror callback. My
> third option would be Paul's attribute-only solution, as it's just
> very hard to use.)

Agreed. The solution I proposed isn't just "very hard to use", it's
actually wrong. If is_XXX are optional attributes, that's not my
solution, and I agree it's *awful*.

Paul.

PS I'd suggest adding a "Rejected proposals" section to the PEP which
mentions the race condition issue and points to this discussion as an
indication that people didn't seem to see it as a problem.

On 2 July 2014 13:41, Ben Hoyt <benhoyt at gmail.com> wrote:
> Thanks for the effort in your response, Paul.
>
> I'm all for KISS, but let's just slow down a bit here.
>
>> I think that thin wrapper is needed - even
>> if the various bells and whistles are useful, they can be built on top
>> of a low-level version (whereas the converse is not the case).
>
> Yes, but API design is important. For example, urllib2 has a kind of
> the "thin wrapper approach", but millions of people use the 3rd-party
> "requests" library because it's just so much nicer to use.
>
> There are low-level functions in the "os" module, but there are also a
> lot of higher-level functions (os.walk) and functions that smooth over
> cross-platform issues (os.stat).
>
> Detailed comments below.
>
>> The return value is an object whose attributes correspond to the data
>> the OS returns about a directory entry:
>>
>>   * name - the object's name
>>   * full_name - the object's full name (including path)
>>   * is_dir - whether the object is a directory
>>   * is file - whether the object is a plain file
>>   * is_symlink - whether the object is a symbolic link
>>
>> On Windows, the following attributes are also available
>>
>>   * st_size - the size, in bytes, of the object (only meaningful for files)
>>   * st_atime - time of last access
>>   * st_mtime - time of last write
>>   * st_ctime - time of creation
>>   * st_file_attributes - Windows file attribute bits (see the
>> FILE_ATTRIBUTE_* constants in the stat module)
>
> Again, this seems like a nice simple idea, but I think it's actually a
> worst-of-both-worlds solution -- it has a few problems:
>
> 1) It's a nasty API to actually write code with. If you try to use it,
> it gives off a "made only for low-level library authors" rather than
> "designed for developers" smell. For example, here's a get_tree_size()
> function I use written in both versions (original is the PEP 471
> version with the addition of .full_name):
>
> def get_tree_size_original(path):
>     """Return total size of all files in directory tree at path."""
>     total = 0
>     for entry in os.scandir(path):
>         if entry.is_dir():
>             total += get_tree_size_original(entry.full_name)
>         else:
>             total += entry.lstat().st_size
>     return total
>
> def get_tree_size_new(path):
>     """Return total size of all files in directory tree at path."""
>     total = 0
>     for entry in os.scandir(path):
>         if hasattr(entry, 'is_dir') and hasattr(entry, 'st_size'):
>             is_dir = entry.is_dir
>             size = entry.st_size
>         else:
>             st = os.lstat(entry.full_name)
>             is_dir = stat.S_ISDIR(st.st_mode)
>             size = st.st_size
>         if is_dir:
>             total += get_tree_size_new(entry.full_name)
>         else:
>             total += size
>     return total
>
> I know which version I'd rather write and maintain! It seems to me new
> users and folks new to Python could easily write the top version, but
> the bottom is longer, more complicated, and harder to get right. It
> would also be very easy to write code in a way that works on Windows
> but bombs hard on POSIX.
>
> 2) It seems like your assumption is that is_dir/is_file/is_symlink are
> always available on POSIX via readdir. This isn't actually the case
> (this was discussed in the original threads) -- if readdir() returns
> dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat()
> anyway to get it. So, as the above definition of get_tree_size_new()
> shows, you have to use getattr/hasattr on everything:
> is_dir/is_file/is_symlink as well as the st_* attributes.
>
> 3) It's not much different in concept to the PEP 471 version, except
> that PEP 471 has a built-in .lstat() method, making the user's life
> much easier. This is the sense in which it's the worst of both worlds
> -- it's a far less nice API to use, but it still has the same issues
> with race conditions the original does.
>
> So thinking about this again:
>
> First, based on the +1's to Paul's new solution, I don't think people
> are too concerned about the race condition issue (attributes being
> different between the original readdir and the os.stat calls). I think
> this is probably fair -- if folks care, they can handle it in an
> application-specific way. So that means Paul's new solution and the
> original PEP 471 approach are both okay on that score.
>
> Second, comparing PEP 471 to Nick's solution: error handling is much
> more straight-forward and simple to document with the original PEP 471
> approach (just try/catch around the function calls) than with Nick's
> get_lstat=True approach of doing the stat() if needed inside the
> iterator. To catch errors with that approach, you'd either have to do
> a "while True" loop and try/catch around next(it) manually (which is
> very yucky code), or we'd have to add an onerror callback, which is
> somewhat less nice to use and harder to document (signature of the
> callback, exception object, etc).
>
> So given all of the above, I'm fairly strongly in favour of the
> approach in the original PEP 471 due to it's easy-to-use API and
> straight-forward try/catch approach to error handling. (My second
> option would be Nick's get_lstat=True with the onerror callback. My
> third option would be Paul's attribute-only solution, as it's just
> very hard to use.)
>
> Thoughts?
>
> -Ben