[Tutor] directory size [using os.path.walk]

Fri, 2 Aug 2002 14:12:00 -0700 (PDT)

On Fri, 2 Aug 2002, Troels Leth Petersen wrote:

> > A variation on this recursive way of finding disk usage can use the
> > os.path.walk() function, which does the tricky recursion stuff for
> us.
> > If you'd like, we can give an example of how to use it.
>
> Well - I would like that. If that offer was meant for Klaus only.

os.path.walk() is a strange creature compared to the other os.path
functions; let's take a look at it more closely.

Let's say that we have the following directory structure:

###
[dyoo@tesuque dyoo]$ find test_walk
test_walk
test_walk/subdir1
test_walk/subdir1/dead
test_walk/subdir1/people
test_walk/subdir2
test_walk/subdir2/.bash_profile
test_walk/subdir2/sample_file
###

So I have this sample directory called test_walk, which itself has two
subdirectories.  test_walk/subdir2 contains two files '.bash_profile' and
'sample_file'.  'dead' and 'people' are files within subdir1.

os.path.walk() is a slightly strange function because of what it takes in
as inputs.  If we look at its documentation:

###
>>> print os.path.walk.__doc__
walk(top,func,arg) calls func(arg, d, files) for each directory "d"
    in the tree  rooted at "top" (including "top" itself).  "files" is a
list
    of all the files and subdirs in directory "d".
###

we'll see that it doesn't just take in a directory to dive through, but it
also a 'func' function!  What this means is that os.path.walk() itself
will start calling the function that we give it.  What we are doing when
we send 'func' to os.path.walk is giving it a 'callback' --- we're
trusting that os.path.walk will call 'func' back as it works through the
directories.

Let's try using it.  I'll create a simple function that just prints out
the directory and the files arguments that os.path.walk() will feed it,
later on:

###
>>> def justPrintTheDirectory(arg, d, files):
...     print "I'm in directory", d
...     print "And I see", files
...
>>> os.path.walk('/home/dyoo/test_walk', justPrintTheDirectory, ())
I'm in directory /home/dyoo/test_walk
And I see ['subdir1', 'subdir2']
I'm in directory /home/dyoo/test_walk/subdir1
And I see ['dead', 'people']
I'm in directory /home/dyoo/test_walk/subdir2
And I see ['.bash_profile', 'sample_file']
###

Now why in the world does os.path.walk() take in three arguments?  In the
example above, I just fed the empty tuple in there because I was lazy.
Why might we want to use that 'arg' parameter?

One reason is because perhaps we might want to accumulate some list or set
of values as we run through the directories.  For example, we can set
'arg' to a list or other container, and fiddle with it in our function:

###
>>> def collect_all_filenames(list_of_filenames, directory, files):
...     for f in files:
...         list_of_filenames.append(os.path.join(directory, f))
...
>>> files = []
>>> os.path.walk('/home/dyoo/test_walk', collect_all_filenames, files)
>>> files
['/home/dyoo/test_walk/subdir1',
 '/home/dyoo/test_walk/subdir1/dead',
 '/home/dyoo/test_walk/subdir1/people',
 '/home/dyoo/test_walk/subdir2',
 '/home/dyoo/test_walk/subdir2/.bash_profile',
 '/home/dyoo/test_walk/subdir2/sample_file']
###

So we allow collect_all_filenames() here to incrementally fill in our
'files' list for us.  A little tricky, but useful.

To tell the truth, I've never liked os.path.walk() --- it doesn't feel
Pythonic to me because it is a bit complex to work with.  We can talk
about how we can wrap this in a class to make it easier to use if you'd
like.

Hope this helps!