[Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info

Fri May 10 12:55:56 CEST 2013

A few of us were having a discussion at
http://bugs.python.org/issue11406 about adding os.scandir(): a
generator version of os.listdir() to make iterating over very large
directories more memory efficient. This also reflects how the OS gives
things to you -- it doesn't give you a big list, but you call a
function to iterate and fetch the next entry.

While I think that's a good idea, I'm not sure just that much is
enough of an improvement to make adding the generator version worth
it.

But what would make this a killer feature is making os.scandir()
generate tuples of (name, stat_like_info). The Windows directory
iteration functions (FindFirstFile/FindNextFile) give you the full
stat information for free, and the Linux and OS X functions
(opendir/readdir) give you partial file information (d_type in the
dirent struct, which is basically the st_mode part of a stat, whether
it's a file, directory, link, etc).

Having this available at the Python level would mean we can vastly
speed up functions like os.walk() that otherwise need to make an
os.stat() call for every file returned. In my benchmarks of such a
generator on Windows, it speeds up os.walk() by 9-10x. On Linux/OS X,
it's more like 1.5-3x. In my opinion, that kind of gain is huge,
especially on Windows, but also on Linux/OS X.

So the idea is to add this relatively low-level function that exposes
the extra information the OS gives us for free, but which os.listdir()
currently throws away. Then higher-level, platform-independent
functions like os.walk() could use os.scandir() to get much better
performance. People over at Issue 11406 think this is a good idea.

HOWEVER, there's debate over what kind of object the second element in
the tuple, "stat_like_info", should be. My strong vote is for it to be
a stat_result-like object, but where the fields are None if they're
unknown. There would be basically three scenarios:

1) stat_result with all fields set: this would happen on Windows,
where you get as much info from FindFirst/FindNext as from an
os.stat()
2) stat_result with just st_mode set, and all other fields None: this
would be the usual case on Linux/OS X
3) stat_result with all fields None: this would happen on systems
whose readdir()/dirent doesn't have d_type, or on Linux/OS X when
d_type was DT_UNKNOWN

Higher-level functions like os.walk() would then check the fields they
needed are not None, and only call os.stat() if needed, for example:

# Build lists of files and directories in path
files = []
dirs = []
for name, st in os.scandir(path):
    if st.st_mode is None:
        st = os.stat(os.path.join(path, name))
    if stat.S_ISDIR(st.st_mode):
        dirs.append(name)
    else:
        files.append(name)

Not bad for a 2-10x performance boost, right? What do folks think?

Cheers,
Ben.

P.S. A few non-essential further notes:

1) As a Windows guy, a nice-to-have addition to os.scandir() would be
a keyword arg like win_wildcard which defaulted to '*.*', but power
users can pass in to utilize the wildcard feature of
FindFirst/FindNext on Windows. We have plenty of other low-level
functions that expose OS-specific features in the OS module, so this
would be no different. But then again, it's not nearly as important as
exposing the stat info.

2) I've been dabbling with this concept for a while in my BetterWalk
library: https://github.com/benhoyt/betterwalk

Note that the benchmarks there are old, and I've made further
improvements in my local copy. The ctypes version gives speed gains
for os.walk() of 2-3x on Windows, but I've also got a C version, which
is giving 9-10x speed gains. I haven't yet got a Linux/OS X version
written in C.

3) See also the previous python-dev thread on BetterWalk:
http://mail.python.org/pipermail/python-ideas/2012-November/017944.html