[Python-ideas] Speed up os.walk() 5x to 9x by using file attributes from FindFirst/NextFile() and readdir()
MRAB
python at mrabarnett.plus.com
Mon Nov 12 18:43:15 CET 2012
On 2012-11-12 09:17, Ben Hoyt wrote:
> It seems many folks think that an os.iterdir() is a good idea, and
> some that agree that something like os.iterdir_stat() for efficient
> directory traversal + stat combination is a good idea. And if we get a
> faster os.walk() for free, that's great too. :-)
>
> Nick Coughlan mentioned his walkdir and Antoine's pathlib. While I
> think these are good third-party libraries, I admit I'm not the
> biggest fan of either of their APIs. HOWEVER, mainly I think that the
> stdlib's os.listdir() and os.walk() aren't going away anytime soon, so
> we might as well make incremental (though significant) improvements to
> them in the meantime.
>
> So I'm going to propose a couple of minimally-invasive changes (API-
> wise), in what I think is order of importance, highest to lowest:
>
> 1) Speeding up os.walk(). I've shown we can easily get a ~5x speedup
> on Windows by not calling stat() on each file. And on Linux/BSD this
> same data is available from readdir()'s dirent, so I presume there's
> be a similar speedup, though it may not be quite 5x.
>
> 2) I also propose adding os.iterdir(path='.') to do exactly the same
> thing as os.listdir(), but yield the results as it gets them instead
> of returning the whole list at once.
>
> 3) Partly for implementing the more efficient walk(), but also for
> general use, I propose adding os.iterdir_stat() which would be like
> iterdir but yield (filename, stat) tuples. If stat-while-iterating
> isn't available on the system, the stat item would be None. If it is
> available, the stat_result fields that the OS presents would be
> available -- the other fields would be None. In practice,
> iterdir_stat() would call FindFirst/Next on Windows and readdir_r on
> Linux/BSD/Mac OS X, and be implemented in posixmodule.c.
>
> This means that on Linux/BSD/Mac OS X it'd return a stat_result with
> st_mode set but the other fields None, on Windows it'd basically
> return the full stat_result, and on other systems it'd return
> (filename, None).
>
> The usage pattern (and exactly how os.walk would use it) would be as
> follows:
>
> for filename, st in os.iterdir_stat(path):
> if st is None or st.st_mode is None:
> st = os.stat(os.path.join(path, filename))
> if stat.S_ISDIR(st.st_mode):
> # handle directory
> else:
> # handle file
>
I'm not sure that I like "st is None or st.st_mode is None".
You say that if a stat field is not available, it's None.
That being the case, if no stat fields are available, couldn't their
fields be None?
That would lead to just "st.st_mode is None".
> I'm very keen on 1). And I think adding 2) and 3) make sense, because
> they're (a) asked for by various folks, (b) fairly simple and self-
> explanatory APIs, and (c) they'll be needed to implement the faster
> os.walk() anyway.
>
> Thoughts? What's the next step? If I come up with a patch against
> posixmodule.c, tests, etc, is this likely to be accepted? I could
> also flesh out my pure-Python proof of concept [1] to do what I'm
> suggesting above and go from there...
>
[snip]
More information about the Python-ideas
mailing list