Optimizing tips for os.listdir

Nick Craig-Wood nick at craig-wood.com
Tue Sep 28 11:30:00 CEST 2004


Bengt Richter <bokr at oz.net> wrote:
>  On 27 Sep 2004 14:30:18 GMT, Nick Craig-Wood <nick at craig-wood.com> wrote:
> 
>  You ought to be able to gain a little by hoisting the os.path.xxx
>  attribute lookups for join and isdir out of the loop. E.g, (not tested)
> 
>      opj=os.path.join; oisd=os.path.isdir
>      [opj(path, p) for p in os.listdir(path) if oisd(opj(path, p))]
> 
>  But it seems like you are asking the os to chase through full paths at
>  every isdir operation, rather than just telling it to make its current working
>  directory the directory you are interested in and doing it there. E.g., (untested)
> 
>      savedir = os.getcwd()
>      os.chdir(path)
>      dirs = [opj(path, p) for p in os.listdir('.') if oisd(p)]
>      os.chdir(savedir)

with 1000 files in the directory

# 1) Original using '.'
/usr/lib/python2.3/timeit.py -s 'import os; path="."' \
'[os.path.join(path, p) for p in os.listdir(path) if os.path.isdir(os.path.join(path, p))]'
10 loops, best of 3: 2.69e+04 usec per loop

# 2) Original with long path
/usr/lib/python2.3/timeit.py -s 'import os;
path="/tmp/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z"' \
'[os.path.join(path, p) for p in os.listdir(path) if os.path.isdir(os.path.join(path, p))]'
10 loops, best of 3: 4.16e+04 usec per loop

# 3) Using cwd
/usr/lib/python2.3/timeit.py -s 'import os' \
'[os.path.join(path, p) for p in os.listdir(".") if os.path.isdir(p)]'
100 loops, best of 3: 1.85e+04 usec per loop

> > The above timings ignore the effect of caching - will the directory
> > you are enumerating be hot in the cache?
>  Even if so, I doubt the os finds it via a hash of the full path instead
>  of checking that every element of the path exists and is a subdirectory.
>  IWT that could be a dangerous short cut

It is.  Linux will look through each path entry.  However they will be
hot in the dcache.  It doesn't take much time hence the relatively
small difference between 1) and 2).

I expect the main difference between 1) and 3) is the fact it contains
one less os.path.join()

/usr/lib/python2.3/timeit.py -s 'import os;' 'os.path.join("a", "b")'
100000 loops, best of 3: 7.34 usec per loop

Its executed 1000 times above which is 7340 usec.  The difference
between 1) and 3) is 8400 usec - pretty close!

-- 
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick



More information about the Python-list mailing list