[Python-ideas] Speed up os.walk() 5x to 9x by using file attributes from FindFirst/NextFile() and readdir()

Fri Nov 9 11:29:33 CET 2012

I've noticed that os.walk() is a lot slower than it needs to be because it
does an os.stat() call for every file/directory. It does this because it uses
listdir(), which doesn't return any file attributes.

So instead of using the file information provided by FindFirstFile() /
FindNextFile() and readdir() -- which listdir() calls -- os.walk() does a
stat() on every file to see whether it's a directory or not. Both
FindFirst/FindNext and readdir() give this information already. Using it would
basically bring the number of system calls down from O(N) to O(log N).

I've written a proof-of-concept (see [1] below) using ctypes and
FindFirst/FindNext on Windows, showing that for sizeable directory trees it
gives a 4x to 6x speedup -- so this is not a micro-optimization!

I started trying the same thing with opendir/readdir on Linux, but don't have
as much experience there, and wanted to get some feedback on the concept
first. I assume it'd be a similar speedup by using d_type & DT_DIR from
readdir().

The problem is even worse when you're calling os.walk() and then doing your
own stat() on each file, for example, to get the total size of all files in a
tree -- see [2]. It means it's calling stat() twice on every file, and I see
about a 9x speedup in this scenario using the info FindFirst/Next provide.

So there are a couple of things here:

1) The simplest thing to do would be to keep the APIs exactly the same, and
get the ~5x speedup on os.walk() -- on Windows, unsure of the exact speedup on
Linux. And on OS's where readdir() doesn't return "is directory" information,
obviously it'd fall back to using the stat on each file.

2) We could significantly improve the API by adding a listdir_stat() or
similar function, which would return a list of (filename, stat_result) tuples
instead of just the names. That might be useful in its own right, but of
course os.walk() could use it to speed itself up. Then of course it might be
good to have walk_stat() which you could use to speed up the "summing sizes"
cases.

Other related improvements to the listdir/walk APIs that could be considered
are:

* Using the wildcard/glob that FindFirst/FindNext take to do filtering -- this
would avoid fnmatch-ing and might speed up large operations, though I don't
have numbers. Obviously we'd have to simulate this with fnmatch on non-Windows
OSs, but this kind of filtering is something I've done with os.walk() many
times, so just having the API option would be useful ("glob" keyword arg?).

* Changing listdir() to yield instead of return a list (or adding yieldir?).
This fits both the FindNext/readdir APIs, and would address issues like [3].

Anyway, cutting a long story short -- do folks think 1) is a good idea? What
about some of the thoughts in 2)? In either case, what would be the best way
to go further on this?

Thanks,
Ben.

[1]: https://gist.github.com/4044946
[2]: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder/2485843#2485843
[3]: http://stackoverflow.com/questions/4403598/list-files-in-a-folder-as-a-stream-to-begin-process-immediately