Writing huge Sets() to disk

Duncan Booth duncan.booth at invalid.invalid
Mon Jan 17 14:25:54 CET 2005

Martin MOKREJŠ wrote:

> Duncan Booth wrote:
>> Almost anything you do copies references.
> But what does this?:
> x = 'xxxxx'

Copies a reference to the existing string 'xxxxx' and stores the new 
reference in the variable x. If its a global variable then this will 
involve creating a new reference to the existing string 'x' and storing 
both references in the globals() dictionary for the module.

> a = x[2:]

Creates a new string by slicing the one reference by x, then creates a new 
reference to the string and stores it in the variable 'a' (as above), then 
releases the first reference to the new string.

> b = z + len(x)

Creates a new reference to 'z'; creates a new reference to 'x', creates a 
new reference to 'len', calls the function (which involves creating and 
releasing a tuple), releases the references to 'len' and 'x', invokes the 
add operator, releases the reference to 'z', then creates a new reference 
to the result while storing it in the variable 'b'. The result of the 
addition could be a new object, or it may just be a new reference created 
to an existing object.

> dict[a] = b

If 'dict' is a variable of type 'dict' (and its a bad idea to use the type 
name as the variable name), then this creates new references to each of 'a' 
and 'b' and stores them in the object 'dict'.

As I said, almost everything you do involves creating and releasing 
references to existing objects.

>> The member variables are all still accessible as member variables
>> until you run your loop at the end to clear them, so no way could
>> Python release them.
> OK, I wanted to know if there's some assignment using a reference,
> which makes the internal garbage collector not to recycle the memory,
> as, for example, the target dictionary still keeps reference to the
> temporary dictionary.
So long as the object is still accessible it won't be free. You aren't 
making anything new reference the temporary dictionary, but so long as the 
old references are still there it can't free the memory.

>> Even if they are Shelf objects, I see no reason here why you have to 
> I gathered from previous discussion it's faster to use bsddb directly,
> so no shelve.
Until you know that speed of accessing the file is a problem, I would go 
for the simple to use solution. At present there is no evidence that the 
speed of accessing the file is going to be a problem: your problems seem to 
stem from excessive memory use which means that page faults are likely to 
swamp everything else as far as execution time is concerned.

>> What on earth is the point of an explicit call to __add__? If Guido
>> had meant us to use __add__ he woudn't have created '+'.
> To invoke additon directly on the object. It's faster than letting
> python to figure out that I sum up int() plus int(). It definitely
> has helped a lot when using Decimal(a) + Decimal(b), where I got rid
> of thousands of Decimal(__new__), __init__ and I think few other
> methods of decimal as well - I think getcontext too. 
The same point as above: it is most unlikely that these additions are going 
to make a noticeable difference to the running time, so leave them as easy 
to read additions until you have profiled the code and determined that they 
really are a hotspot that needs optimisiing. The less time you spend 
optimising prematurely, the sooner you get a program that meets your needs.

>> tuple of your values then you don't have to use split or cast the
>> strings 
> bsddb creaps on me that I can store as a key or value only a string.
> I'd love to store tuple.
>>>> import bsddb
>>>> _words1 = bsddb.btopen('db1.db', 'c')
>>>> _words1['a'] = 1    
> Traceback (most recent call last):
>  File "<stdin>", line 1, in ?
>  File "/usr/lib/python2.3/bsddb/__init__.py", line 120, in __setitem__
>    self.db[key] = value
> TypeError: Key and Data values must be of type string or None.
> How can I record a number then?

Either by wrapping your database access in code which converts the values 
to/from string, or by using a package which already does that (such as 

> Can I use the mmap() feature on bsddb or any .db file? Most of the
> time I do updates, not inserts! I don't want to rewrite all the time
> 300MB file. I want to update it. What I do need for it? Know the
> maximal length of a string value keept in the .db file? Can I get rid
> of locking support in those huge files?

Dont worry about it, updates won't rewrite the entire file.

> Definitely I can improve my algorithm. But I believe I'll always have
> to work with those huge files.

The point of using something like a bsddb file is that you get a system 
which will do the hard work for you of figuring out which bits to keep in 
memory and which bits to leave on disc. The sort of processing you showed, 
which updates the database in a random order but probably doesn't access 
every record should work well with a bsddb file.

The code you show is dumping the contents of the temporary dictionary into 
the database then clearing the dictionary. Why are you waiting until the 
dictionaries occupy 300Mb before doing this. What happens to the runtime if 
you clear out the dictionaries when they occupy 30Mb or 3Mb?

Questions about what will run fastest are almost impossible to answer 
without actually timing the results on some realistic data. Dumping the 
dictionaries frequently when they are comparatively small will result in 
higher runtime overheads since you will be updating some keys much more 
often, but you save by reducing the working set of your process. If your 
working set gets too large then expect massive increases in runtime, so it 
will be worth making sure that you aren't hitting this point before you 
consider any other optimisations.

More information about the Python-list mailing list