[Python-ideas] BetterWalk, a better and faster os.walk() for Python

Ben Hoyt benhoyt at gmail.com
Mon Nov 26 09:14:49 CET 2012


> I'm really happy that someones looking into this.  I've done some
> similar work for my day job and have some thoughts about the APIs and
> approach.

Thanks!

> extra information is returned/yielded from the call. What I do
> differently is that I (a) always return the d_type value instead of
> falling back to stat'ing the item (b) do not provide the pattern
> argument.

Yeah, returning more stat fields was a suggestion of someone's on
python-ideas, and (b) was my idea to allow me to tap into Windows
wildcard matching. Both of which I think are simple and good.

> I like returning the d_type directly because in the unix style APIs the
> dirent structure doesn't provide the same stuff as the stat result and I
> don't want to trick myself into thinking I have all the information
> available from the readdir call. I also like to have my Python functions
> map pretty closely to the C calls.

Here I disagree. Though I would, being a heavy Windows user. :-) As
somebody else mentioned, on Windows, the API here is nothing like the
FindFirst/Next C calls. In general, I think the stdlib should tend
towards getting more cross-platform, not more Linux-ish. In the case
of my stat fields, it's not any more cross-platform, but at least the
st_mode field is something the stdlib can already handle.

> For example there is a potential race condition between calling the
> readdir and the stat, like if the object is removed between calls. I can
> be very granular (for lack of a better word) about my error handling in
> these cases.

That's a good point. I'm not sure it'd be a big deal in practice. But
it's worth thinking about. Perhaps the os.stat() call should catch
OSError and return None for all fields if it failed. But maybe that's
suppressing too much. Or maybe it could be an option
(stat_errors=True).

> Because I come at this from a Linux platform I am also not so keen on
> the built in pattern matching that comes for "free" from the
> FindFirst/FindNext Window's API provides. It just feels like this should
> be provided at a higher layer. But I can't say for sure because I don't
> know how much of a performance boost this is on Windows.

I don't know about the performance boost here either. I suspect it's
significant only in certain cases (when you're matching a small
fragment of files in a large directory) but I should do some
performance tests.

-Ben



More information about the Python-ideas mailing list