os.path.walk (was: Re: Optimizing code)

Fri Feb 25 10:42:57 EST 2000

Guido van Rossum <guido at cnri.reston.va.us> écrit:
> François Pinard <pinard at iro.umontreal.ca> writes:

> > It would be nice if the Python library was maintaining a little cache for
> > `stat', and if there was a way for users to interface with it as wanted.

> Question-- what would happen to code that uses os.path.exists() or one
> of its friends repeatedly, waiting for a file to appear or disappear?
> This code would break unless you put a timeout in your little stat
> cache, which would probably reduce its speed.

I was suggesting to have a way for users to interface with the cache,
also meaning that there should also be a way to avoid it.

> > the `find' program has some optimisations to avoid calling `stat',

> "thus it behooves us to arrange directories so that subdirectories occur
> first" (or similar words).

I do not think most people go that far, in practice.  And yet, `find' is
a lot faster now than it used to be in its infancy.  Particularly useful
is the (frequent) case of a directory having no subdirectories at all.

> > The main trick is to save the number of links on the `.' entry [...]
> > When it reaches 2, we know that no directories remain [...]  I surely
> > often use `os.path.walk' in my own things, so any speed improvement
> > in that area would be welcome for me.

> I can imagine any number of reasons why the above rule might fail (mount
> points, NFS, Samba, symbolic links, automounter, race conditions).

And yet, `find' seems to work flawlessly and dependably, and has done so
for years.  I'm not maintaining it, however, and do not know its bug story.

> I'm reluctant to add hacks like this to the standard library -- a bug
> there multiplies by a million.

I surely understand your concern for solidity, something like Python could
just not succeed if it was not dependable.  Yet, speed is also something to
consider, especially when it comes to sparing spurious disk accesses: you may
quickly loose there the result of tremendous efforts at optimising the rest.

Or course, I could reprogram os.path.walk for my things.  I thought it
would have been nice if the standard walk in Python was behaving the best
it could, depending on the system it runs on.  When a Python script walks
the structure of a disk, this is often where most of the time is spent,
and for big file hierarchies, this time can become quite significant.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard