[Python-ideas] BetterWalk, a better and faster os.walk() for Python

Andrew Barnert abarnert at yahoo.com
Mon Nov 26 08:52:39 CET 2012


From: John Mulligan <phlogistonjohn at asynchrono.us>
Sent: Sun, November 25, 2012 4:54:45 AM

> Agreed, keeping things separate might be a better approach. I wanted to point 
> out the usefulness of an enhanced listdir/iterdir as its own beast in 
addition 
> to improving os.walk. 

Agreed. I would have uses for an iterdir_stat on its own.

> There is one thing that is advantageous about creating an ideal enhanced 
> os.walk. People would only have to change the module walk is getting imported 
> from, no changes would have to be made anywhere else even if that code is 
> using features like the ability to modify dirnames (when topdown=True).

Sure, there's a whole lot of existing code, and existing knowledge in people's 
heads, so, if it turns out to be easy, we might as well provide a drop-in 
replacement. 

But if it's not, I'd be happy enough with something like, "You can use fts.walk 
as a faster replacement for os.walk if you don't modify dirnames. If you do need 
to modify dirnames, either stick with os.path, or rewrite your code around fts."

Anyway, the code below is the obvious way to implement os.walk on top of fts, 
but I'd need to think it through and test it to see if it handles everything 
properly:

def walk(path, topdown=True, onerror=None, followlinks=False):
    level = None
    dirents, filenames = [], []
    dirpath = path
    with open([path], 
              ((LOGICAL if followlinks else PHYSICAL) | 
               NOSTAT | NOCHDIR | COMFOLLOW)) as f:
        for ent in f:
            if ent.level != level:
                dirnames = [dirent.name for dirent in dirents]
                yield dirpath, dirnames, filenames
                for dirent in dirents:
                    if dirent.name not in dirnames:
                        f.skip(dirent)
                level = ent.level
                dirents, filenames = [], []
                path = os.path.join(path, ent.path)
            else:
                if ent.info in (D, DC):
                    if topdown:
                        dirents.append(ent)
                elif ent.info == DP:
                    if not topdown:
                        dirents.append(ent)
                elif ent.info in (DNR, ERR):
                    if onerror:
                        # make OSError with filename member
                        onerror(None) 
                elif ent.info == DOT:
                    pass
                else:
                    filenames.append(ent.name)
    dirnames = [dirent.name for dirent in dirents]
    yield dirpath, dirnames, filenames

> I am not sure if fts or other platform specific API could be wrangled into an 
> exact drop in replacement.

My goal isn't to use fts as a platform-specific API to re-implement os.walk, but 
to replace os.walk with a better API which is just a pythonized version of the 
fts API (and is available on every platform, as efficiently as possible).

If that also gives us a drop-in replacement for os.walk, that's gravy.


> > From: John Mulligan <phlogistonjohn at asynchrono.us>
> >  Sent: Fri, November 23, 2012 8:13:22 AM
> > 
> > > I like returning  the d_type directly because in  the unix style APIs the
> > > dirent  structure doesn't provide the same stuff as  the stat result and I
> >  > don't want to trick myself into thinking I have all  the  information
> > > available from the readdir call. I also like to have  my  Python functions
> > > map pretty closely to the C  calls.
> > 
> > Of course that means that implementing the same  interface on Windows means
> > faking d_type from the stat result, and  making the functions map less
> > closely to the C calls…
> 
> I agree, I  don't know if it would be better to simply have platform dependent 

> fields/values in the struct or if it is better to abstract things in this  
>case. 
>
> Anyway, the betterwalk code is already converting constants from the  Windows 
> API to mode values. Something similar might be possible for d_type  values as 
> well.

I was just bringing up the point that, in your quest for mapping Python to C as 
thinly as possibly on POSIX, you're inherently making the mapping a little 
thicker on Windows. That isn't necessarily a problem—the same thing is true for 
much of the os module today, after all—just something to keep in mind.

Either way, I would want to have some way of knowing "is this entry a directory" 
without having to figure out which of two values I need to check based on my 
platform, if at all possible.

> > > In addition I have a fditerdir call that supports a directory  file
> > > descriptor as the first argument. This is handy because I also  have a
> > > wrapper for fstatat (this was all created for Python 2 and  before 3.3
> > > was released).
> > 
> > This can only be  implemented on platforms that support the *at functions. I
> > believe that  means just linux and OpenBSD right now, other *BSD (including
> > OS X) at  some unspecified point in the future. Putting something like that
> > in the  stdlib would probably require also adding another function like
> >  os_supports_at (similar to supports_fd, supports_dirfd, etc.), but  that's
> > not a big deal.
> 
> I agree that this requires supporting  platforms. (I've run this on FreeBSD as 

> well.) I didn't mean to imply that  this should be required for a better walk 
> function. I wanted to provide some  color about the value of exposing alternate 
>
> listdir-type functions  themselves and not just as a stepping stone on the way 

> to enhancing  walk.


This also raises the point that there is no "ffts" or "ftsat" on any platform I 
know of, and in fact implementing the former wouldn't be totally trivial, 
because fts remembers the root paths… So, if we wanted an fwalk, it might be a 
bit trickier than an fiterdir.



More information about the Python-ideas mailing list