[Patches] [ python-Patches-994057 ] faster os.walk

Thu Jul 22 00:35:18 CEST 2004

Patches item #994057, was opened at 2004-07-19 20:04
Message generated for change (Comment added) made by felixlee
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=994057&group_id=5470

Category: Library (Lib)
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Felix Lee (felixlee)
Assigned to: Nobody/Anonymous (nobody)
Summary: faster os.walk

Initial Comment:
On Unix-like filesystems, you can get a substantial
speedup for directory traversal if you look at a
directory's st_nlink count to find out how many
subdirectories are in a directory.  Once you've seen
that many subdirs, you know the rest of the entries
have to be nondirectory files, so you don't have to
stat() them.  this is often a big win because most
people put most of their files in leaf directories.

For my home directory, using the link count reduces it
from 100,000 stat calls to 20,000 stat calls, which
cuts execution time about 70% when all the filesystem
data is already in RAM.  I think the speedup is more
when the data has to be read from disk, but I don't
have an easy way of measuring that.

the GNU "find" program does this optimization.  so does
Perl's File::Find.

the attached file is a proof of concept implementation.
 if this seems like a reasonable thing to do, I can
rewrite it as a patch with appropriate documentation.

there are a few problems with it.

1. it needs to be tested on OSes I don't have.  this
uses a simple heuristic to decide whether the link
count can be used or not, so it shouldn't be necessary
to add OS-specific tests, but I don't know how perverse
other OSes are.

2. to do the nlink optimization, links to directories
have to be considered files rather than directories. 
this is different from existing behavior, where links
to directories are returned as directories but never
entered.  I don't think the existing behavior is
useful, but it seems like a bad idea to change it,
since there may be things that depend on it.

the simple way of handling this is to make use_nlink=0
the default, so the speedup and the changed behavior
happen only when someone specifically requests it.  I
don't really like this, because I think it's rare that
use_nlink=0 is desirable if the heuristic is reliable.

another way of handling this is to make a new function
instead of modifying os.walk.  a new function could
also have the option of following links without getting
stuck in loops.

3. the faster walk is still noticeably slower than
list(popen("find")).  (Perl has this problem too.) 
there isn't really much to be done about that, it's
mostly just interpreter overhead.  this speed
difference probably isn't significant in most cases,
and cross-platform portability is a good reason for
using walk instead of popen('find').

----------------------------------------------------------------------

>Comment By: Felix Lee (felixlee)
Date: 2004-07-21 22:35

Message:
Logged In: YES 
user_id=77867

thinking about it some more, a separate function is probably
a good idea.  at the moment I'm calling it 'lwalk' (analogy
to stat/lstat).  it should be usable on systems without
links too.  I'll attach it in a day or two.

----------------------------------------------------------------------

Comment By: Felix Lee (felixlee)
Date: 2004-07-21 21:37

Message:
Logged In: YES 
user_id=77867

the chdir() optimization seems to be mostly insignificant on
modern systems (I'm doing timings on linux 2.6).  perl will
do chdir too, and it's only slightly faster than python if
nlinks gets used, and the amount of system time used is
pretty indistinguishable among all the versions.

I suspect ram is cheap enough that large dir->inode caches
are easy and fast.  the remaining difference in speed
between GNU find and a perl or python version is in user
time, and I think most of that can be attributed to
interpreter overhead.  crossing a function call boundary
100,000 times is hard to make fast in any interpreted language.

the nlink optimization is significant, I think that's worth
doing, but I don't think any other optimization is.

----------------------------------------------------------------------

Comment By: Jeff Epler (jepler)
Date: 2004-07-21 15:52

Message:
Logged In: YES 
user_id=2772

GNU find also uses os.chdir() to descend into the directory
hierarchy, which means that all its "stat" calls use
filenames within the current directory.  Clearly, python's
os.walk can't do this, because the working directory is
shared among threads.  Disaster would result.  But, anyway,
this is another reason that os.walk is slower than find.

----------------------------------------------------------------------

Comment By: Michael Chermside (mcherm)
Date: 2004-07-21 12:05

Message:
Logged In: YES 
user_id=99874

Okay, I like the idea (haven't read the patch yet). Your
problem 1 will just take time, and problem 3 isn't really a
problem. But problem 2 is a real issue. I think having a new
function is not worth it (or rather, I'd say, it's worth
having, just not in the standard library). I think having a
hardly-ever-used option "use_nlinik" is a non-starter. So
despite liking the idea, I'm stimied^H^H stymied^H^H
stimyed^H^H confounded.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=994057&group_id=5470