[Patches] [ python-Patches-994057 ] faster os.walk

Wed Jul 21 14:05:06 CEST 2004

Patches item #994057, was opened at 2004-07-19 16:04
Message generated for change (Comment added) made by mcherm
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=994057&group_id=5470

Category: Library (Lib)
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Felix Lee (felixlee)
Assigned to: Nobody/Anonymous (nobody)
Summary: faster os.walk

Initial Comment:
On Unix-like filesystems, you can get a substantial
speedup for directory traversal if you look at a
directory's st_nlink count to find out how many
subdirectories are in a directory.  Once you've seen
that many subdirs, you know the rest of the entries
have to be nondirectory files, so you don't have to
stat() them.  this is often a big win because most
people put most of their files in leaf directories.

For my home directory, using the link count reduces it
from 100,000 stat calls to 20,000 stat calls, which
cuts execution time about 70% when all the filesystem
data is already in RAM.  I think the speedup is more
when the data has to be read from disk, but I don't
have an easy way of measuring that.

the GNU "find" program does this optimization.  so does
Perl's File::Find.

the attached file is a proof of concept implementation.
 if this seems like a reasonable thing to do, I can
rewrite it as a patch with appropriate documentation.

there are a few problems with it.

1. it needs to be tested on OSes I don't have.  this
uses a simple heuristic to decide whether the link
count can be used or not, so it shouldn't be necessary
to add OS-specific tests, but I don't know how perverse
other OSes are.

2. to do the nlink optimization, links to directories
have to be considered files rather than directories. 
this is different from existing behavior, where links
to directories are returned as directories but never
entered.  I don't think the existing behavior is
useful, but it seems like a bad idea to change it,
since there may be things that depend on it.

the simple way of handling this is to make use_nlink=0
the default, so the speedup and the changed behavior
happen only when someone specifically requests it.  I
don't really like this, because I think it's rare that
use_nlink=0 is desirable if the heuristic is reliable.

another way of handling this is to make a new function
instead of modifying os.walk.  a new function could
also have the option of following links without getting
stuck in loops.

3. the faster walk is still noticeably slower than
list(popen("find")).  (Perl has this problem too.) 
there isn't really much to be done about that, it's
mostly just interpreter overhead.  this speed
difference probably isn't significant in most cases,
and cross-platform portability is a good reason for
using walk instead of popen('find').

----------------------------------------------------------------------

>Comment By: Michael Chermside (mcherm)
Date: 2004-07-21 08:05

Message:
Logged In: YES 
user_id=99874

Okay, I like the idea (haven't read the patch yet). Your
problem 1 will just take time, and problem 3 isn't really a
problem. But problem 2 is a real issue. I think having a new
function is not worth it (or rather, I'd say, it's worth
having, just not in the standard library). I think having a
hardly-ever-used option "use_nlinik" is a non-starter. So
despite liking the idea, I'm stimied^H^H stymied^H^H
stimyed^H^H confounded.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=994057&group_id=5470