Writing huge Sets() to disk

Mon Jan 17 06:37:28 EST 2005

Martin MOKREJ© wrote:

> Hi,
>   could someone tell me what all does and what all doesn't copy
> references in python. I have found my script after reaching some
> state and taking say 600MB, pushes it's internal dictionaries
> to hard disk. The for loop consumes another 300MB (as gathered
> by vmstat) to push the data to dictionaries, then releases
> little bit less than 300MB and the program start to fill-up
> again it's internal dictionaries, when "full" will do the
> flush again ...

Almost anything you do copies references.

>  
>   The point here is, that this code takes a lot of extra memory.
> I believe it's the references problem, and I remeber complains
> of frineds facing same problem. I'm a newbie, yes, but don't
> have this problem with Perl. OK, I want to improve my Pyhton
> knowledge ... :-))
> 
> 
> 
> 
<long code extract snipped>
> 
> 
>    The above routine doesn't release of the memory back when it
> exits. 
That's probably because there isn't any memory it can reasonable be 
expected to release. What memory would *you* expect it to release?

The member variables are all still accessible as member variables until you 
run your loop at the end to clear them, so no way could Python release 
them.

Some hints:

When posting code, try to post complete examples which actually work. I 
don't know what type the self._dict_on_diskXX variables are supposed to be. 
It makes a big difference if they are dictionaries (so you are trying to 
hold everything in memory at one time) or shelve.Shelf objects which would 
store the values on disc in a reasonably efficient manner.

Even if they are Shelf objects, I see no reason here why you have to 
process everything at once. Write a simple function which processes one 
tmpdict object into one dict_on_disk object and then closes the 
dict_on_disk object. If you want to compare results later then do that by 
reopening the dict_on_disk objects when you have deleted all the tmpdicts.

Extract out everything you want to do into a class which has at most one 
tmpdict and one dict_on_disk That way your code will be a lot easier to 
read.

Make your code more legible by using fewer underscores.

What on earth is the point of an explicit call to __add__? If Guido had 
meant us to use __add__ he woudn't have created '+'.

What is the purpose of dict_on_disk? Is it for humans to read the data? If 
not, then don't store everything as a string. Far better to just store a 
tuple of your values then you don't have to use split or cast the strings 
to integers. If you do want humans to read some final output then produce 
that separately from the working data files.

You split out 4 values from dict_on_disk and set three of them to 0. If 
that really what you meant or should you be preserving the previous values?

Here is some (untested) code which might help you:

import shelve

def push_to_disc(data, filename):
    database = shelve.open(filename)
    try:
        for key in data:
            if database.has_key(key):
                count, a, b, expected = database[key]
                database[key] = count+data[key], a, b, expected
            else:
                database[key] = data[key], 0, 0, 0
    finally:
        database.close()

    data.clear()

Call that once for each input dictionary and your data will be written out 
to a disc file and the internal dictionary cleared without any great spike 
of memory use.