Why is os.stat so slow?
rick.arnett at gmail.com
Tue Jun 16 00:07:22 CEST 2009
Fairly new to python but have programmed in other languages (C, Java)
before. I was experimenting with a python program that needed to take
a directory tree and get the total disk usage of every file (and
subfolder) underneath it. This solution also has to run on Windows
Server 2003 for work and it is accessing a NAS shared via CIFS. A
sample folder I'm using contains about 26,000 subfolders and 435,000
files. The original solution I came up with was elegant, but
extremely slow (compared to doing a right click in Windowsexplorer on
the folder tree and clicking properties). It looked something like
folder = r'Z:\foldertree'
folder_size = 0
for (path, dirs, files) in os.walk(folder):
for file in files:
folder_size += os.path.getsize(os.path.join(path,file))
I profiled the above code and os.stat was taking up roughly 90% of the
time. After digging around, I found some code in another post to use
win32api to use API calls to speed this up (if you are interested,
search for supper fast walk, yes super is misspelled). To my
surprise, the average time is now about 1/7th of what it used to be.
I believe the problem is that my simple solution had to call os.stat
twice (once in the os.walk and once by me calling os.path.getsize) for
every file and folder in the tree.
I understand that os.stat can work on any OS. However, the expense
should not be that dramatic of a difference (in my opinion). Is there
an OS agnostic way to get this functionality to work faster?
Also, if I wanted to port this to Linux or some other OS, is os.stat
as expensive? If so, are there other libraries (like win32api) to
assist in doing these operations faster?
More information about the Python-list