[Python-ideas] BetterWalk, a better and faster os.walk() for Python

Fri Nov 23 17:08:03 CET 2012

Hi, I've been idly watching python-ideas and this thread piqued my 
interest, so I'm unlurking for the first time.

I'm really happy that someones looking into this.  I've done some 
similar work for my day job and have some thoughts about the APIs and 
approach. 

I come at this from a C & Linux POV and wrote some similar wrappers to 
your iterdir_stat. What I do similarly is to provide a "flags" field 
like your "fields" argument (in my case a bitmask) that controls what 
extra information is returned/yielded from the call. What I do 
differently is that I (a) always return the d_type value instead of 
falling back to stat'ing the item (b) do not provide the pattern 
argument. 

I like returning the d_type directly because in the unix style APIs the 
dirent structure doesn't provide the same stuff as the stat result and I 
don't want to trick myself into thinking I have all the information 
available from the readdir call. I also like to have my Python functions 
map pretty closely to the C calls. I know that my Python is only issuing 
the same syscalls that the equivalent C code would. In addition, I like 
control over error handling that calling stat as a separate call gives 
me. For example there is a potential race condition between calling the 
readdir and the stat, like if the object is removed between calls. I can 
be very granular (for lack of a better word) about my error handling in 
these cases.

Because I come at this from a Linux platform I am also not so keen on 
the built in pattern matching that comes for "free" from the 
FindFirst/FindNext Window's API provides. It just feels like this should 
be provided at a higher layer. But I can't say for sure because I don't 
know how much of a performance boost this is on Windows.

I have a confession to make: I don't often use an os.walk equivalent 
when I use my library. I often call the listdir equivalents directly. So 
I've never benchmarked any os.walk equivalent even though I wrote one 
for fun!

In addition I have a fditerdir call that supports a directory file 
descriptor as the first argument. This is handy because I also have a 
wrapper for fstatat (this was all created for Python 2 and before 3.3 
was released).

I really like how your library is better in that you can get more fields 
from the direntry, I only support the d_type field at this time and have 
been meaning to extend the API. I can only yield tuples at the moment 
but a namedtuple style would be much nicer. IMO, think the ideal value 
would be some sort of abstract direntry structure that could be filled 
in with the values that readdir or FindFirst provide and then possibly 
provide a higher level function that combines iterdir + stat if you get 
DT_UNKNOWN.  In other words, provide an easy call like iterdir_stat that 
builds on an iterdir that gets the detailed dentry data.  

PS. 
If anyone is curious my library is available here: 
https://bitbucket.org/nasuni/fsnix 

Thanks!
-- John M.

On Friday, November 23, 2012 12:39:42 AM Ben Hoyt wrote:
> In the recent thread I started called "Speed up os.walk()..." [1] I
> was encouraged to create a module to flesh out the idea, so I present
> you with BetterWalk:
> 
> https://github.com/benhoyt/betterwalk#readme
> 
> It's basically all there, and works on Windows, Linux, and Mac OS X.
> It probably works on FreeBSD too, but I haven't tested that. I also
> haven't written thorough unit tests yet, but intend to after some
> further feedback.
> 
> In terms of the API for iterdir_stat(), I settled on the more explicit
> "pass in what stat fields you want" (the 'fields' parameter). I also
> added a 'pattern' parameter to allow you to make use of the wildcard
> matching that FindFirst/FindNext provide (it's useful for globbing on
> POSIX too, but not a performance improvement).
> 
> As for benchmarks, it's about what I saw earlier on Windows (2-6x on
> recent versions, depending). My initial tests on Mac OS X show it's
> 5-10x as fast on that platform! I haven't double-checked those results
> yet though.
> 
> The results on Linux were somewhat disappointing -- only a 10% speed
> improvement on large directories, and it's actually slower on small
> directories. It's still doing half the number of system calls ... so I
> believe this is because cached os.stat() is super fast on Linux, and
> so the slowdown from using ctypes / pure Python is outweighing the
> gain from not doing the system call. That said, I've also only tested
> Linux in a VirtualBox setup, so maybe that's affecting it too.
> 
> Still, if it's a significant win for Windows and OS X users, it's a
> good thing.
> 
> In any case, I'd love it if folks could run the benchmark on their
> system (with and without -s) and comment further on the idea and API.
> 
> Thanks,
> Ben.
> 
> [1]
> http://mail.python.org/pipermail/python-ideas/2012-November/017770.ht
> ml _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas