[Python-ideas] Speed up os.walk() 5x to 9x by using file attributes from FindFirst/NextFile() and readdir()

Mon Nov 12 10:17:41 CET 2012

It seems many folks think that an os.iterdir() is a good idea, and
some that agree that something like os.iterdir_stat() for efficient
directory traversal + stat combination is a good idea. And if we get a
faster os.walk() for free, that's great too. :-)

Nick Coughlan mentioned his walkdir and Antoine's pathlib. While I
think these are good third-party libraries, I admit I'm not the
biggest fan of either of their APIs. HOWEVER, mainly I think that the
stdlib's os.listdir() and os.walk() aren't going away anytime soon, so
we might as well make incremental (though significant) improvements to
them in the meantime.

So I'm going to propose a couple of minimally-invasive changes (API-
wise), in what I think is order of importance, highest to lowest:

1) Speeding up os.walk(). I've shown we can easily get a ~5x speedup
on Windows by not calling stat() on each file. And on Linux/BSD this
same data is available from readdir()'s dirent, so I presume there's
be a similar speedup, though it may not be quite 5x.

2) I also propose adding os.iterdir(path='.') to do exactly the same
thing as os.listdir(), but yield the results as it gets them instead
of returning the whole list at once.

3) Partly for implementing the more efficient walk(), but also for
general use, I propose adding os.iterdir_stat() which would be like
iterdir but yield (filename, stat) tuples. If stat-while-iterating
isn't available on the system, the stat item would be None. If it is
available, the stat_result fields that the OS presents would be
available -- the other fields would be None. In practice,
iterdir_stat() would call FindFirst/Next on Windows and readdir_r on
Linux/BSD/Mac OS X, and be implemented in posixmodule.c.

This means that on Linux/BSD/Mac OS X it'd return a stat_result with
st_mode set but the other fields None, on Windows it'd basically
return the full stat_result, and on other systems it'd return
(filename, None).

The usage pattern (and exactly how os.walk would use it) would be as
follows:

    for filename, st in os.iterdir_stat(path):
        if st is None or st.st_mode is None:
            st = os.stat(os.path.join(path, filename))
        if stat.S_ISDIR(st.st_mode):
            # handle directory
        else:
            # handle file

I'm very keen on 1). And I think adding 2) and 3) make sense, because
they're (a) asked for by various folks, (b) fairly simple and self-
explanatory APIs, and (c) they'll be needed to implement the faster
os.walk() anyway.

Thoughts? What's the next step? If I come up with a patch against
posixmodule.c, tests, etc, is this likely to be accepted? I could
also flesh out my pure-Python proof of concept [1] to do what I'm
suggesting above and go from there...

Thanks,
Ben

[1] https://gist.github.com/4044946

On Sat, Nov 10, 2012 at 1:23 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Fri, Nov 9, 2012 at 8:29 PM, Ben Hoyt <benhoyt at gmail.com> wrote:
>>
>> Anyway, cutting a long story short -- do folks think 1) is a good idea?
>> What
>> about some of the thoughts in 2)? In either case, what would be the best
>> way
>> to go further on this?
>
>
> It's even worse when you add NFS (and other network filesystems) into the
> mix, so yes, +1 on devising a more efficient API design for bulk stat
> retrieval than the current listdir+explicit-stat approach that can lead to
> an excessive number of network round trips.
>
> It's a complex enough idea that it definitely needs some iteration outside
> the stdlib before it could be added, though.
>
> You could either start exploring this as a new project, or else if you
> wanted to fork my walkdir project on BitBucket I'd be interested in
> reviewing any pull requests you made along those lines - redundant stat
> calls are currently one of the issues with using walkdir for more complex
> tasks. (However you decide to proceed, you'll need to set things up to build
> an extension module, though - walkdir is pure Python at this point).
>
> Another alternative you may want to explore is whether or not Antoine Pitrou
> would be interested in adding such a capability to his pathlib module.
> pathlib already includes stat result caching in Path objects, and thus may
> be able to support a clean API for returning path details with the stat
> results precached.
>
> Cheers,
> Nick.
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia