[Python-Dev] pathlib and issue 11406 (a directory iterator returning stat-like info)

Sun Nov 24 23:20:08 CET 2013

Hi folks,

I decided to start another thread for my thoughts on the interaction
between pathlib (Antoine's new PEP 428), issue 11406 (proposal for a
directory iterator returning stat-like info), and my own scandir
library, which implements something along the lines of issue 11406.

My scandir library (https://github.com/benhoyt/scandir) is something
I've been working on for a while -- it provides a scandir() function
which uses the OS's directory iterator functions to expose as much
stat-like information as possible (readdir and FindFirstFile etc).
This way functions like os.walk() can use the info (particularly
"is_dir()") and not require tons of extra calls to os.stat().

This provides a huge speed boost for os.walk() in many cases: I've
seen 3-4x on Linux, and up to 20x on Windows. (It depends on various
things, not least of which is Windows' weird stat caching -- if I run
my scandir benchmark "fresh", I get os.walk() running 8-9 times as
fast as the built-in one. But if I run it after an un-hibernate,
suddenly it runs 18-20 times as fast as the built-in one. Either way,
huge gains, especially on Windows.)

scandir.scandir() returns a DirEntry object, which has .isdir(),
.isfile(), .islink(), and .lstat() attributes. Look familiar? When I
was reading PEP 428 and saw .is_file(), .is_dir(), and .stat(), I
thought -- surely I can merge this with pathlib and Path objects.

The first thing I can do to scandir is rename my isdir() type
attributes to match PEP 428's, so that DirEntry quacks like a Path
object where it can.

However, I'm wondering if I can change scandir to return actual Path
objects. Or better, because Path already helpfully provides iterdir()
which yields Path objects, and Path objects have .is_dir() etc, can
scandir()-like behaviour simply work out-of-the-box?

This mainly depends on how Path is going to cache stat information. If
it caches it, then this will just work. Sounds like Guido's opinion
was that both cached and uncached use cases are important, but that it
should be very clear which one you're getting. I personally like the
.stat() and .restat() idea.

The other related thing is that DirEntry only provides .lstat(),
because it's providing stat-like info without following links.

Note in this context that it's not just "network filesystems" on which
stat() is slow (https://mail.python.org/pipermail/python-dev/2013-May/125805.html).
It's quite slow in Windows under various conditions too.

See also Nick Coghlan's post about a DirEntry-style object on the
issue 11406 thread:
https://mail.python.org/pipermail/python-dev/2013-May/126148.html

Thoughts and suggestions for how to merge scandir with pathlib's
approach? It's important to me that pathlib's API doesn't cut itself
off from a more efficient implement of the ideas from issue 11406 and
scandir...

Thanks,
Ben.