[Python-ideas] Speed up os.walk() 5x to 9x by using file attributes from FindFirst/NextFile() and readdir()

Jim Jewett jimjjewett at gmail.com
Thu Nov 15 00:51:42 CET 2012


On 11/12/12, Ben Hoyt <benhoyt at gmail.com> wrote:

> 1) Speeding up os.walk(). I've shown we can easily get a ~5x speedup
> on Windows by not calling stat() on each file. And on Linux/BSD this
> same data is available from readdir()'s dirent, so I presume there's
> be a similar speedup, though it may not be quite 5x.

> 2) I also propose adding os.iterdir(path='.') to do exactly the same
> thing as os.listdir(), but yield the results as it gets them instead
> of returning the whole list at once.

I know that two functions may be better than a keyword, but a
combinatorial explosion of functions ... isn't.  Even given that
listdir can't change for backwards compatibility, and given that
iteration might be better for large directories, I'm still not sure an
exact analogue is worth it.

Could this be someone combined with your 3rd proposal?  For example,
instead of returning a list of str (or bytes) names, could you return
a generator that would yield some sort of File objects?  (Well,
obviously you *could*, the question is whether that goes too far down
the rabbit hole of what a Path object should have.)  My strawman is an
object such that

(a)  No extra system calls will be made just to fill in data not
available from the dir entry itself.  I wouldn't even promise a name,
though I can't think of a sensible directory listing that doesn't
provide the name.

(b)  Any metadata you do have -- name, fd, results of stat, results of
getxattr ... will be available as an attribute.  That way users of
filesystems that do send back the size or type won't have to call
stat.

(c)  Attributes will default to None, supporting the "if x is None:
x=stat()" pattern for the users who do care about attributes that were
not available quickly.  (If there is an attribute for which "None" is
actually meaningful, the user can use hasattr -- but that is a corner
case, not worth polluting the API for.)

*Maybe* teach open about these objects, so that it can look for the
name or fd attributes.

Alternatively, it could return a str (or bytes) subclass that has the
other attributes when they are available.  That seems a bit contrived,
but might be better for backwards compatibility.  (os.walk could
return such objects too, instead of extracting the name.)

-jJ



More information about the Python-ideas mailing list