Writing huge Sets() to disk
duncan.booth at invalid.invalid
Mon Jan 17 14:25:54 CET 2005
Martin MOKREJ wrote:
> Duncan Booth wrote:
>> Almost anything you do copies references.
> But what does this?:
> x = 'xxxxx'
Copies a reference to the existing string 'xxxxx' and stores the new
reference in the variable x. If its a global variable then this will
involve creating a new reference to the existing string 'x' and storing
both references in the globals() dictionary for the module.
> a = x[2:]
Creates a new string by slicing the one reference by x, then creates a new
reference to the string and stores it in the variable 'a' (as above), then
releases the first reference to the new string.
> b = z + len(x)
Creates a new reference to 'z'; creates a new reference to 'x', creates a
new reference to 'len', calls the function (which involves creating and
releasing a tuple), releases the references to 'len' and 'x', invokes the
add operator, releases the reference to 'z', then creates a new reference
to the result while storing it in the variable 'b'. The result of the
addition could be a new object, or it may just be a new reference created
to an existing object.
> dict[a] = b
If 'dict' is a variable of type 'dict' (and its a bad idea to use the type
name as the variable name), then this creates new references to each of 'a'
and 'b' and stores them in the object 'dict'.
As I said, almost everything you do involves creating and releasing
references to existing objects.
>> The member variables are all still accessible as member variables
>> until you run your loop at the end to clear them, so no way could
>> Python release them.
> OK, I wanted to know if there's some assignment using a reference,
> which makes the internal garbage collector not to recycle the memory,
> as, for example, the target dictionary still keeps reference to the
> temporary dictionary.
So long as the object is still accessible it won't be free. You aren't
making anything new reference the temporary dictionary, but so long as the
old references are still there it can't free the memory.
>> Even if they are Shelf objects, I see no reason here why you have to
> I gathered from previous discussion it's faster to use bsddb directly,
> so no shelve.
Until you know that speed of accessing the file is a problem, I would go
for the simple to use solution. At present there is no evidence that the
speed of accessing the file is going to be a problem: your problems seem to
stem from excessive memory use which means that page faults are likely to
swamp everything else as far as execution time is concerned.
>> What on earth is the point of an explicit call to __add__? If Guido
>> had meant us to use __add__ he woudn't have created '+'.
> To invoke additon directly on the object. It's faster than letting
> python to figure out that I sum up int() plus int(). It definitely
> has helped a lot when using Decimal(a) + Decimal(b), where I got rid
> of thousands of Decimal(__new__), __init__ and I think few other
> methods of decimal as well - I think getcontext too.
The same point as above: it is most unlikely that these additions are going
to make a noticeable difference to the running time, so leave them as easy
to read additions until you have profiled the code and determined that they
really are a hotspot that needs optimisiing. The less time you spend
optimising prematurely, the sooner you get a program that meets your needs.
>> tuple of your values then you don't have to use split or cast the
> bsddb creaps on me that I can store as a key or value only a string.
> I'd love to store tuple.
>>>> import bsddb
>>>> _words1 = bsddb.btopen('db1.db', 'c')
>>>> _words1['a'] = 1
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "/usr/lib/python2.3/bsddb/__init__.py", line 120, in __setitem__
> self.db[key] = value
> TypeError: Key and Data values must be of type string or None.
> How can I record a number then?
Either by wrapping your database access in code which converts the values
to/from string, or by using a package which already does that (such as
> Can I use the mmap() feature on bsddb or any .db file? Most of the
> time I do updates, not inserts! I don't want to rewrite all the time
> 300MB file. I want to update it. What I do need for it? Know the
> maximal length of a string value keept in the .db file? Can I get rid
> of locking support in those huge files?
Dont worry about it, updates won't rewrite the entire file.
> Definitely I can improve my algorithm. But I believe I'll always have
> to work with those huge files.
The point of using something like a bsddb file is that you get a system
which will do the hard work for you of figuring out which bits to keep in
memory and which bits to leave on disc. The sort of processing you showed,
which updates the database in a random order but probably doesn't access
every record should work well with a bsddb file.
The code you show is dumping the contents of the temporary dictionary into
the database then clearing the dictionary. Why are you waiting until the
dictionaries occupy 300Mb before doing this. What happens to the runtime if
you clear out the dictionaries when they occupy 30Mb or 3Mb?
Questions about what will run fastest are almost impossible to answer
without actually timing the results on some realistic data. Dumping the
dictionaries frequently when they are comparatively small will result in
higher runtime overheads since you will be updating some keys much more
often, but you save by reducing the working set of your process. If your
working set gets too large then expect massive increases in runtime, so it
will be worth making sure that you aren't hitting this point before you
consider any other optimisations.
More information about the Python-list