Processing huge datasets

Mon May 10 08:00:03 EDT 2004

Hi,

I'm trying to process a large filesystem (+20 million files) and keep the
directories along with summarized information about the files (sizes,
modification times, newest file and the like) in an instance hierarchy
in memory. I read the information from a Berkeley Database.

I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
operate on recursively.

First I banged my head on the recursion limit, which could luckily be
adjusted.
Now I simply get MemoryError.

Is there a clever way of processing huge datasets in Python?
How would a smart Python programmer advance the problem?

I'm looking at rewriting the code to operate on parts of the hierarchy at a
time and store the processed data structure in another Berkeley DB so I can
query that afterwards. But I'd really prefer keeping all things in memory
due to the huge performance gain.

Any pointers?

Cheers, Anders