[Python-ideas] BetterWalk, a better and faster os.walk() for Python
Andrew Barnert
abarnert at yahoo.com
Mon Nov 26 08:52:39 CET 2012
From: John Mulligan <phlogistonjohn at asynchrono.us>
Sent: Sun, November 25, 2012 4:54:45 AM
> Agreed, keeping things separate might be a better approach. I wanted to point
> out the usefulness of an enhanced listdir/iterdir as its own beast in
addition
> to improving os.walk.
Agreed. I would have uses for an iterdir_stat on its own.
> There is one thing that is advantageous about creating an ideal enhanced
> os.walk. People would only have to change the module walk is getting imported
> from, no changes would have to be made anywhere else even if that code is
> using features like the ability to modify dirnames (when topdown=True).
Sure, there's a whole lot of existing code, and existing knowledge in people's
heads, so, if it turns out to be easy, we might as well provide a drop-in
replacement.
But if it's not, I'd be happy enough with something like, "You can use fts.walk
as a faster replacement for os.walk if you don't modify dirnames. If you do need
to modify dirnames, either stick with os.path, or rewrite your code around fts."
Anyway, the code below is the obvious way to implement os.walk on top of fts,
but I'd need to think it through and test it to see if it handles everything
properly:
def walk(path, topdown=True, onerror=None, followlinks=False):
level = None
dirents, filenames = [], []
dirpath = path
with open([path],
((LOGICAL if followlinks else PHYSICAL) |
NOSTAT | NOCHDIR | COMFOLLOW)) as f:
for ent in f:
if ent.level != level:
dirnames = [dirent.name for dirent in dirents]
yield dirpath, dirnames, filenames
for dirent in dirents:
if dirent.name not in dirnames:
f.skip(dirent)
level = ent.level
dirents, filenames = [], []
path = os.path.join(path, ent.path)
else:
if ent.info in (D, DC):
if topdown:
dirents.append(ent)
elif ent.info == DP:
if not topdown:
dirents.append(ent)
elif ent.info in (DNR, ERR):
if onerror:
# make OSError with filename member
onerror(None)
elif ent.info == DOT:
pass
else:
filenames.append(ent.name)
dirnames = [dirent.name for dirent in dirents]
yield dirpath, dirnames, filenames
> I am not sure if fts or other platform specific API could be wrangled into an
> exact drop in replacement.
My goal isn't to use fts as a platform-specific API to re-implement os.walk, but
to replace os.walk with a better API which is just a pythonized version of the
fts API (and is available on every platform, as efficiently as possible).
If that also gives us a drop-in replacement for os.walk, that's gravy.
> > From: John Mulligan <phlogistonjohn at asynchrono.us>
> > Sent: Fri, November 23, 2012 8:13:22 AM
> >
> > > I like returning the d_type directly because in the unix style APIs the
> > > dirent structure doesn't provide the same stuff as the stat result and I
> > > don't want to trick myself into thinking I have all the information
> > > available from the readdir call. I also like to have my Python functions
> > > map pretty closely to the C calls.
> >
> > Of course that means that implementing the same interface on Windows means
> > faking d_type from the stat result, and making the functions map less
> > closely to the C calls…
>
> I agree, I don't know if it would be better to simply have platform dependent
> fields/values in the struct or if it is better to abstract things in this
>case.
>
> Anyway, the betterwalk code is already converting constants from the Windows
> API to mode values. Something similar might be possible for d_type values as
> well.
I was just bringing up the point that, in your quest for mapping Python to C as
thinly as possibly on POSIX, you're inherently making the mapping a little
thicker on Windows. That isn't necessarily a problem—the same thing is true for
much of the os module today, after all—just something to keep in mind.
Either way, I would want to have some way of knowing "is this entry a directory"
without having to figure out which of two values I need to check based on my
platform, if at all possible.
> > > In addition I have a fditerdir call that supports a directory file
> > > descriptor as the first argument. This is handy because I also have a
> > > wrapper for fstatat (this was all created for Python 2 and before 3.3
> > > was released).
> >
> > This can only be implemented on platforms that support the *at functions. I
> > believe that means just linux and OpenBSD right now, other *BSD (including
> > OS X) at some unspecified point in the future. Putting something like that
> > in the stdlib would probably require also adding another function like
> > os_supports_at (similar to supports_fd, supports_dirfd, etc.), but that's
> > not a big deal.
>
> I agree that this requires supporting platforms. (I've run this on FreeBSD as
> well.) I didn't mean to imply that this should be required for a better walk
> function. I wanted to provide some color about the value of exposing alternate
>
> listdir-type functions themselves and not just as a stepping stone on the way
> to enhancing walk.
This also raises the point that there is no "ffts" or "ftsat" on any platform I
know of, and in fact implementing the former wouldn't be totally trivial,
because fts remembers the root paths… So, if we wanted an fwalk, it might be a
bit trickier than an fiterdir.
More information about the Python-ideas
mailing list