[Python-ideas] BetterWalk, a better and faster os.walk() for Python
John Mulligan
phlogistonjohn at asynchrono.us
Fri Nov 23 17:08:03 CET 2012
Hi, I've been idly watching python-ideas and this thread piqued my
interest, so I'm unlurking for the first time.
I'm really happy that someones looking into this. I've done some
similar work for my day job and have some thoughts about the APIs and
approach.
I come at this from a C & Linux POV and wrote some similar wrappers to
your iterdir_stat. What I do similarly is to provide a "flags" field
like your "fields" argument (in my case a bitmask) that controls what
extra information is returned/yielded from the call. What I do
differently is that I (a) always return the d_type value instead of
falling back to stat'ing the item (b) do not provide the pattern
argument.
I like returning the d_type directly because in the unix style APIs the
dirent structure doesn't provide the same stuff as the stat result and I
don't want to trick myself into thinking I have all the information
available from the readdir call. I also like to have my Python functions
map pretty closely to the C calls. I know that my Python is only issuing
the same syscalls that the equivalent C code would. In addition, I like
control over error handling that calling stat as a separate call gives
me. For example there is a potential race condition between calling the
readdir and the stat, like if the object is removed between calls. I can
be very granular (for lack of a better word) about my error handling in
these cases.
Because I come at this from a Linux platform I am also not so keen on
the built in pattern matching that comes for "free" from the
FindFirst/FindNext Window's API provides. It just feels like this should
be provided at a higher layer. But I can't say for sure because I don't
know how much of a performance boost this is on Windows.
I have a confession to make: I don't often use an os.walk equivalent
when I use my library. I often call the listdir equivalents directly. So
I've never benchmarked any os.walk equivalent even though I wrote one
for fun!
In addition I have a fditerdir call that supports a directory file
descriptor as the first argument. This is handy because I also have a
wrapper for fstatat (this was all created for Python 2 and before 3.3
was released).
I really like how your library is better in that you can get more fields
from the direntry, I only support the d_type field at this time and have
been meaning to extend the API. I can only yield tuples at the moment
but a namedtuple style would be much nicer. IMO, think the ideal value
would be some sort of abstract direntry structure that could be filled
in with the values that readdir or FindFirst provide and then possibly
provide a higher level function that combines iterdir + stat if you get
DT_UNKNOWN. In other words, provide an easy call like iterdir_stat that
builds on an iterdir that gets the detailed dentry data.
PS.
If anyone is curious my library is available here:
https://bitbucket.org/nasuni/fsnix
Thanks!
-- John M.
On Friday, November 23, 2012 12:39:42 AM Ben Hoyt wrote:
> In the recent thread I started called "Speed up os.walk()..." [1] I
> was encouraged to create a module to flesh out the idea, so I present
> you with BetterWalk:
>
> https://github.com/benhoyt/betterwalk#readme
>
> It's basically all there, and works on Windows, Linux, and Mac OS X.
> It probably works on FreeBSD too, but I haven't tested that. I also
> haven't written thorough unit tests yet, but intend to after some
> further feedback.
>
> In terms of the API for iterdir_stat(), I settled on the more explicit
> "pass in what stat fields you want" (the 'fields' parameter). I also
> added a 'pattern' parameter to allow you to make use of the wildcard
> matching that FindFirst/FindNext provide (it's useful for globbing on
> POSIX too, but not a performance improvement).
>
> As for benchmarks, it's about what I saw earlier on Windows (2-6x on
> recent versions, depending). My initial tests on Mac OS X show it's
> 5-10x as fast on that platform! I haven't double-checked those results
> yet though.
>
> The results on Linux were somewhat disappointing -- only a 10% speed
> improvement on large directories, and it's actually slower on small
> directories. It's still doing half the number of system calls ... so I
> believe this is because cached os.stat() is super fast on Linux, and
> so the slowdown from using ctypes / pure Python is outweighing the
> gain from not doing the system call. That said, I've also only tested
> Linux in a VirtualBox setup, so maybe that's affecting it too.
>
> Still, if it's a significant win for Windows and OS X users, it's a
> good thing.
>
> In any case, I'd love it if folks could run the benchmark on their
> system (with and without -s) and comment further on the idea and API.
>
> Thanks,
> Ben.
>
> [1]
> http://mail.python.org/pipermail/python-ideas/2012-November/017770.ht
> ml _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
More information about the Python-ideas
mailing list