[Python-Dev] os.path.walk() lacks 'depth first' option
Mon, 21 Apr 2003 15:15:54 -0400
This is a multi-part message in MIME format.
Content-type: text/plain; charset=iso-8859-1
>>> But if I had to do it over again, I wouldn't have added walk() in the
>>> current form.
>> I think it's the perfect place for a generator.
> Absolutely! So let's try to write something new based on generators,
> make it flexible enough so that it can handle pre-order or post-order
> visits, and then phase out os.walk().
I posted one last night, with a bug (it failed to pass the topdown flag
through to recursive calls).
Here's that again, with the bug repaired, sped up some, and with a
docstring. Double duty: the example in the docstring shows why we don't
want to make a special case out of sum(): empty lists can arise
What else would people like in this? I really like separating the directory
names from the plain-file names, so don't bother griping about that <wink>.
It's at least as fast as the current os.path.walk() (it's generally faster
for me, but times for this are extremely variable on Win98). Removing the
internal recursion doesn't appear to make a measureable difference when
walking my Python tree, although because recursive generators require time
proportional to the current stack depth to deliver a result to the caller,
and to resume again, removing recursion could be much more efficient on an
extremely deep tree. The biggest speedup I could find on Windows was via
using os.chdir() liberally, so that os.path.join() calls weren't needed, and
os.path.isdir() calls worked directly on one-component names. I suspect
this has to do with that Win98 doesn't have an effective way to cache
directory lookups under the covers. Even so, it only amounted to a 10%
speedup: directory walking is plain slow on Win98 no matter how you do it.
The attached doesn't play any gross speed tricks.
Content-type: text/plain; name=walk.py
Content-disposition: attachment; filename=walk.py
def walk(top, topdown=True):
"""Directory tree generator.
For each directory in the directory tree rooted at top (including top
itself, but excluding '.' and '..'), yields a 3-tuple
dirpath, dirnames, filenames
dirpath is a string, the path to the directory. dirnames is a list of
the names of the subdirectories in dirpath (excluding '.' and '..').
filenames is a list of the names of the non-directory files in dirpath.
If optional arg 'topdown' is true or not specified, the triple for a
directory is generated before the triples for any of its subdirectories
(directories are generated top down). If topdown is false, the triple
for a directory is generated after the triples for all of its
subdirectories (directories are generated bottom up).
When topdown is true, the caller can modify the dirnames list in-place
(e.g. via del or slice assignment), and walk will only recurse into the
subdirectories whose names remain in dirnames; this can be used to prune
the search, or to impose a specific order of visiting. Modifying
dirnames when topdown is false is ineffective, since the directories in
dirnames have already been generated by the time dirnames itself is
Caution: if you pass a relative pathname for top, don't change the
current working directory between resumptions of walk.
from os.path import join, getsize
for root, dirs, files in walk('python/Lib/email'):
print root, "consumes",
print sum([getsize(join(root, name)) for name in files]),
print "bytes in", len(files), "non-directory files"
if 'CVS' in dirs:
dirs.remove('CVS') # don't visit CVS directories
from os.path import join, isdir
names = os.listdir(top)
exceptions = ('.', '..')
dirs, nondirs = , 
for name in names:
if name not in exceptions:
if isdir(join(top, name)):
yield top, dirs, nondirs
for name in dirs:
for x in walk(join(top, name), topdown):
if not topdown:
yield top, dirs, nondirs