[Python-Dev] os.path.walk() lacks 'depth first' option

Mon, 21 Apr 2003 19:33:03 -0400

[Noah Spurrier]
> I like your version; although I used a different name to avoid
> confusion with os.path.walk.

Who's confused <wink>?  I agree it needs some other name if something like
this gets checked in.

> Note that os.listdir does not include the special entries '.' and '..'
> even if they are present in the directory, so there is no need
> to remove them.

Oops -- that's right!  This is a code divergence problem.  There's more than
one implementation of os.path.walk in the core, and the version in ntpath.py
(which I started from) still special-cases '.' and '..'.  I don't think it
needs to.

> Tim Peters>    for path in walk(root):
> Tim Peters>Or it could look like
> Tim Peters>    for top, names in walk(root):
> Tim Peters>or
> Tim Peters>    for top, dirnames, nondirnames in walk(root):
>
> I like the idea of yielding (top, dirs, nondirs), but
> often I want to perform the same operations on both
> dirs and nondirs so separating them doesn't help that case.

I think (a) that's unusual, and (b) it doesn't hurt that case either.  You
can do, e.g.,

    for root, dirs, files in walk(...):
        for name in dirs + files:

to squash them together again.

> This seems to be a situation where there is no typical case,
> so my preference is for the simpler interface.
> It also eliminates the need to build two new lists from
> the list you get from os.listdir()...

Sorry, I'm unmovable on this point.  My typical uses for this function do
have to separates dirs from non-dirs, walk() has to make the distinction
*anyway* (for its internal use), and it's expensive for the client to do the
join() and isdir() bits all over again (isdir() is a filesystem op, and at
least on my box repeated isdir() is overwhelmingly more costly than
partitioning or joining a Python list).

> In fact, I prefer your first suggestion (for path in walk(root)), but
> that would require building a new list by prepending the
> basepath to each element of children because os.listdir does not
> return full path.

What about that worries you?  I don't like it because I have some
directories with many thousands of files, and stuffing a long redundant path
at the start of each is wasteful in the abstract.  I'm not sure it really
matters, though -- e.g., 10K files in a directory * 200 redundant chars each
= a measly 2 megabytes wasted <wink>.

> So finally in this example, I just went with returning the basepath
> and the children (both files and directories).
>
> Following Tom Good's example I added an option to
> ignore symbolic links.

Not all Python platforms have symlinks, of course.  The traditional answer
to this one was that if a user wanted to avoid chasing those on a platform
that supports them, they should prune the symlink names out of the fnames
list passed to walk's func callback.  The same kind of trick is still
available in the generator version, although it was and remains painful.
Separating the dirs from the non-dirs for the caller at least reduces the
expense of it.

> It would be better to detect cycles or at least prevent going
> higher up in the tree.
> ...
> This example still allows you to prune the search
> in Breadth first mode by removing elements from
> the children list. That is cool.
>      for top, children in walk('C:/code/python', depthfirst=False):
>          print top, children
>          if 'CVS' in children:
>              children.remove('CVS')

I'm finding you too hard to follow here, becuase your use of "depthfirst"
and "breadthfirst" doesn't follow normal usage of the terms.  Here's normal
usage:  consider this tree (A's kids are B, C, D; B's kids are E, F; C's are
G, H, I; D's are J, K):

                   A
          B        C         D
         E F     G H I      J K

A depth-first left-to-right traversal is what you get out of a natural
recursive routine.  It sees the nodes internally in this order:

   A B E F C G H I D J K

In a preorder DFS (depth first search), you deliver ("do something with" --
print it, yield it, whatever) the node before delivering its children.
Preorder DFS in the tree above delivers the nodes in order

   A B E F C G H I D J K

which is the same order in which nodes are first seen.  This is what I
called "top down".  In a postorder DFS, you deliver the node *after*
delivering its children, although you still first see nodes in the same
order.  Postorder left-to-right DFS in the tree above delivers nodes in this
order:

   E F B G H I C J K D A

This is what I called "bottom up".

A breadth-first search can't be done naturally using recursion; you need to
maintain an explicit queue for that (or write convoluted recursive code).  A
BFS on the tree above would see the nodes in this order:

   A B C D E F G H I J K

It can be programmed like so, given a suitable queue implementation:

     queue = SuitableQueueImplementation()
     queue.enqueue(root)
     while queue:
          node = queue.dequeue()
          for child in node.children():
              queue.enqueue(child)

Nobody has written a breadth-first traverser in this thread.  If someone
wants to, there are again preorder and postorder variations, although only
preorder BFS falls naturally out of the code just above.

The current os.path.walk() delivers directories in preorder depth-first
left-to-right order, BTW.

>      for name in children:
>          fullpath = os.path.join(basepath, name)
>          if os.path.isdir (fullpath) and not (ignorelinks and
> os.path.islink(fullpath)):

Despte what I said above <wink>, I expect the ignorelinks argument is a good
idea.

>              for (next_basepath, next_children) in walktree
> (fullpath, depthfirst, ignorelinks):
>                  yield next_basepath, next_children

Note that there's no need to pull apart 2-tuples and paste them together
again here;

    for x in walktree(...):
        yield x

does the same thing.