[Python-ideas] Speed up os.walk() 5x to 9x by using file attributes from FindFirst/NextFile() and readdir()
benhoyt at gmail.com
Sun Nov 18 21:52:39 CET 2012
> Passing FTS_NOSTAT to fts is about 3x faster, but only 8% faster than os.walk
> with the stat calls hacked out, and 40% slower than find.
Relatedly, I've just finished a proof-of-concept version of
iterdir_stat() for Linux using readdir_r and ctypes, and it was only
about 10% faster than the existing os.walk on large directories. I was
surprised by this, given that I saw a 400% speedup removing the
stat()s on Windows, but I guess it means that stat() and/or system
calls in general are *much* faster or better cached on Linux.
Still, it's definitely worth the huge speedup on Windows, and I think
it's the right thing to use the dirent d_type info on Linux, even
though the speed gain is small -- it's still faster, and it still
saves all those os.stat()s. Also, I'm doing this via ctypes in pure
Python, so doing it in C may give another small boost especially for
the Linux version.
If anyone wants to test what speeds they're getting on Linux or
Windows, or critique my proof of concept, please try it at
https://github.com/benhoyt/betterwalk -- just run "python benchmark.py
[directory]" on a large directory. Note this is only a proof of
concept at this stage, not hardened code!
> So, a "nostat" option is a potential performance improvement, but switching to
> ftw/nftw/fts, with or without the nostat flag, doesn't seem to be worth it.
Agreed. Also, this is beyond the scope of my initial suggestion.
More information about the Python-ideas