supper fast walk

gangli at gangli at
Tue Sep 12 21:12:45 CEST 2000

When I am asked to develop a program to clean up hard disk with huge
collection of directories and files (100K), I convinced my boss to let
me use Python to do it.  I promised I would deliver the program 10
times faster than anybody doing it in C++ or 5 times in Java. I did
deliver it on time, but the program runs very slow.  It was 10 times
slower than using NT GUI Find in simple case.  I used profile to look
into the problem.  The os.path.isdir took 46% cpu time alone inside
os.path.walk!  I used win32 function to rewrite walk, and speed up my
program to as fast as NT Find.  See code below:

import sys, os, string, time, re

from win32api import FindFiles
DIR_EXCLUDES = ('.', '..')
def win_walk(top, func, arg):
    """Directory tree walk with callback function.

    win_walk(top, func, arg) calls func(arg, d, f_objs, dirs) for each
    d in the tree rooted at top (including top itself); f_objs is a
tuple of file
    attributes of all the files and subdirs in directory d. subdirs are
    walking subdirectories.
        find all files under directory: top
        return variable is a tuple contains file attributes list
        that item 0 is File Attributes, item 8 is name. (see win32api
        f_objs = FindFiles(top+'/*')
        # sort out subdirs
        subdirs = []
        for f_obj in f_objs:
            if f_obj[0] & FILE_ATTRIBUTE_DIRECTORY and \
                f_obj[8] not in DIR_EXCLUDES:
    except os.error:
    # call callback function
    func(arg, top, f_objs, subdirs)
    # do walking
    for dir in subdirs:
        name = top+'/'+dir
        win_walk(name, func, arg)

# remember current time
CUR_TIME = time.time()
# get time module compatible time from PyTime object
def wpy2time(pytime):
    f_time = int(pytime) # file last write time
    #fix win32 PyTime bug
    return f_time - time.altzone

# find all debug directories that are older than one week
debug_m = re.compile('abc.+\.debug', re.I).match
HOUR_24 = 24*3600
WEEK_1 = HOUR_24*7
def win_act(verbose, top, f_objs, dirs):
    if verbose > 1: print "checking directory:", top
    for f_obj in f_objs:
        dir = f_obj[8] # file name
        if dir not in dirs: # directory only
        if dir[-1] in ('g','G') and debug_m(dir):
            dirs.remove(dir) #stop looking into this
            f_time = wpy2time(f_obj[3]) # file last write time
            f_age = CUR_TIME - f_time
            if (f_age > WEEK_1): # file is older the 24 hours
                path = top+'/'+dir
                print 'delete directory:', path

from os import listdir
from os.path import isdir, walk, getmtime

def act(verbose, top, names):
    if verbose > 1: print "checking directory:", top
    dirs = names[:]
    for dir in dirs:
        path = top+'/'+dir
        if not isdir(path): # directory only
        if dir[-1] in ('g','G') and debug_m(dir):
            names.remove(dir) #stop looking into this
            f_time = getmtime(path) # file last write time
            f_age = CUR_TIME - f_time
            if (f_age > WEEK_1): # file is older than one week
                print 'delete directory:', path

verbose = 0
top = "d:/projects"

tt = time.time()
os.path.walk(top, act, verbose)
print 'walk time spent:', time.time() - tt

tt = time.time()
win_walk(top, win_act, verbose)
print 'win_walk time spent:', time.time() - tt

# The End *****************************************


If we change os.listdir to return a list of useString kind of object
that can do, isdir, getmtime, we can replace os.path.walk and take NT
advantage to speed up whole process

Sent via
Before you buy.

More information about the Python-list mailing list