Writing huge Sets() to disk

Mon Jan 17 08:14:24 EST 2005

Steve Holden wrote:
> Martin MOKREJŠ wrote:
> 
>> Hi,
>>  could someone tell me what all does and what all doesn't copy
>> references in python. I have found my script after reaching some
>> state and taking say 600MB, pushes it's internal dictionaries
>> to hard disk. The for loop consumes another 300MB (as gathered
>> by vmstat) to push the data to dictionaries, then releases
>> little bit less than 300MB and the program start to fill-up
>> again it's internal dictionaries, when "full" will do the
>> flush again ...
>>
>>  The point here is, that this code takes a lot of extra memory.
>> I believe it's the references problem, and I remeber complains
>> of frineds facing same problem. I'm a newbie, yes, but don't
>> have this problem with Perl. OK, I want to improve my Pyhton
>> knowledge ... :-))
>>
> Right ho! In fact I suspect you are still quite new to programming as a 
> whole, for reasons that may become clear as we proceed.
> 
>>
>>
>>
>>    def push_to_disk(self):
>>        _dict_on_disk_tuple = (None, self._dict_on_disk1, 
>> self._dict_on_disk2, self._dict_on_disk3, self._dict_on_disk4, 
>> self._dict_on_disk5, self._dict_on_disk6, self._dict_on_disk7, 
>> self._dict_on_disk8, self._dict_on_disk9, self._dict_on_disk10, 
>> self._dict_on_disk11, self._dict_on_disk12, self._dict_on_disk13, 
>> self._dict_on_disk14, self._dict_on_disk15, self._dict_on_disk16, 
>> self._dict_on_disk17, self._dict_on_disk18, self._dict_on_disk19, 
>> self._dict_on_disk20)

The None above is there just as I didn't want to evaluate all the time
something like "x+1":

for x in _dict_on_disk_tuple:
    key = ....
    newdict[x+1][key] = ...

This None doesn't hurt me, then x contains exactly the value I need several
times in the loop ...

> 
> 
> It's a bit unfortunate that all those instance variables are global to 
> the method, as it means we can't clearly see what you intend them to do. 
> However ...
> 
> Whenever I see such code, it makes me suspect that the approach to the 
> problem could be more subtle. It appears you have decided to partition 
> your data into twenty chunks somehow. The algorithm is clearly not coded 
> in a way that would make it easy to modify the number of chunks.

No, it's not, but that's not the speed problem, really. ;)

> 
> [Hint: by "easy" I mean modifying a statement that reads
> 
>     chunks = 20
> 
> to read
> 
>     chunks = 40
> 
> for example]. To avoid this, we might use (say) a list of temp edicts, 
> for example (the length of this could easily then be parameterized as 
> mentioned. So where (my psychic powers tell me) your __init__() method 
> currently contains
> 
>     self._dict_on_disk1 = something()
>     self._dict_on_disk2 = something()
>         ...
>     self._dict_on_disk20 = something()

Almost. They are bsddb dictionary files.

> 
> I would have written
> 
>     self._disk_dicts = []
>     for i in range(20):
>         self._disk_dicts.append(something)
> 
> Than again, I probably have an advantage over you. I'm such a crappy 
> typist I can guarantee I'd make at least six mistakes doing it your way :-)

Originally I had this, I just to get of one small list. ;)

> 
>>        _size = 0
> 
> 
> What with all these leading underscores I presume it must be VERY 
> important to keep these object's instance variables private. Do you have 
> a particular reason for that, or just general Perl-induced paranoia? :-)

See below. ;)

> 
>>        #
>>        # sizes of these tmpdicts range from 10-10000 entries for each!
>>        for _tmpdict in (self._tmpdict1, self._tmpdict2, 
>> self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, 
>> self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, 
>> self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, 
>> self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, 
>> self._tmpdict19, self._tmpdict20):
>>            _size += 1
>>            if _tmpdict:
>>                _dict_on_disk = _dict_on_disk_tuple[_size]
>>                for _word, _value in _tmpdict.iteritems():
>>                    try:
>>                        _string = _dict_on_disk[_word]
>>                        # I discard _a and _b, maybe _string.find(' ') 
>> combined with slice would do better?
>>                        _abs_count, _a, _b, _expected_freq = 
>> _string.split()
>>                        _abs_count = int(_abs_count).__add__(_value)
>>                        _t = (str(_abs_count), '0', '0', '0')
>>                    except KeyError:
>>                        _t = (str(_value), '0', '0', '0')
>>
>>                    # this writes a copy to the dict, right?
>>                    _dict_on_disk[_word] = ' '.join(_t)
>>
>>        #
>>        # clear the temporary dictionaries in ourself
>>        # I think this works as expected and really does release memory
>>        #
>>        for _tmpdict in (self._tmpdict1, self._tmpdict2, 
>> self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, 
>> self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, 
>> self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, 
>> self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, 
>> self._tmpdict19, self._tmpdict20):
>>            _tmpdict.clear()
>>
> There you go again with that huge tuple. You just like typing, don't 
> you? You already wrote that one out just above. Couldn't you have 
> assigned it to a local variable?

Well, in this case I was looking what's slow, and wanted to avoid one slice
on a referenced tuple. ;)

> 
> By the way, remind me again of the reason for the leading None in the 
> _dict_on_disk_tuple, would you?

I was told to code this way - to make it clear that they are internal/local
to the function/method. Is it wrong or just too long? ;-)

> 
> The crucial misunderstanding here might be the meaning of "release 
> memory". While clearing the dictionary will indeed remove references to 
> the objects formerly contained therein, and thus (possibly) render those 
> items subject to garbage collection, that *won't* make the working set 
> (i.e. virtual memory pages allocated to your process's data storage) any 
> smaller. The garbage collector doesn't return memory to the operating 
> system, it merely aggregates it for use in storing new Python objects.

Well, it would be fine, but I thought python returns that immediately.

>>   The above routine doesn't release of the memory back when it
>> exits.
>>
> And your evidence for this assertion is ...?

Well, the swapspace reserved grows during that posted loop.
Tell me, there were about 5 local variable in that loop.
Their contents while the loop iterates get constantly updated, so they
shouldn't need more and more space, right? But they do! It's 300MB.
it seems to me like the variable always points to a new value, but the old
value still persists in the memory not only the the source tmpdictionary,
but also somewhere else as it used to be referenced by that local variable.
I'd say those 300MB account for those references.

Most of the space is returned to the system (really, swap get's freed)
when the loop get to the point where the dictionaries are cleared.
More or less same behaviour I see of course even without swap,
it just happened to me to be easily visible with the swap problem.

> 
>>
>>   See, the loop takes 25 minutes already, and it's prolonging
>> as the program is in about 1/3 or 1/4 of the total input.
>> The rest of my code is fast in contrast to this (below 1 minute).
>>
>> -rw-------  1 mmokrejs users 257376256 Jan 17 11:38 diskdict12.db
>> -rw-------  1 mmokrejs users 267157504 Jan 17 11:35 diskdict11.db
>> -rw-------  1 mmokrejs users 266534912 Jan 17 11:28 diskdict10.db
>> -rw-------  1 mmokrejs users 253149184 Jan 17 11:21 diskdict9.db
>> -rw-------  1 mmokrejs users 250232832 Jan 17 11:14 diskdict8.db
>> -rw-------  1 mmokrejs users 246349824 Jan 17 11:07 diskdict7.db
>> -rw-------  1 mmokrejs users 199999488 Jan 17 11:02 diskdict6.db
>> -rw-------  1 mmokrejs users  66584576 Jan 17 10:59 diskdict5.db
>> -rw-------  1 mmokrejs users   5750784 Jan 17 10:57 diskdict4.db
>> -rw-------  1 mmokrejs users    311296 Jan 17 10:57 diskdict3.db
>> -rw-------  1 mmokrejs users 295895040 Jan 17 10:56 diskdict20.db
>> -rw-------  1 mmokrejs users 293634048 Jan 17 10:49 diskdict19.db
>> -rw-------  1 mmokrejs users 299892736 Jan 17 10:43 diskdict18.db
>> -rw-------  1 mmokrejs users 272334848 Jan 17 10:36 diskdict17.db
>> -rw-------  1 mmokrejs users 274825216 Jan 17 10:30 diskdict16.db
>> -rw-------  1 mmokrejs users 273104896 Jan 17 10:23 diskdict15.db
>> -rw-------  1 mmokrejs users 272678912 Jan 17 10:18 diskdict14.db
>> -rw-------  1 mmokrejs users 260407296 Jan 17 10:13 diskdict13.db
>>
>>    Some spoke about mmaped files. Could I take advantage of that
>> with bsddb module or bsddb?
>>
> No.

I'll ask in a different way? How do you update huge files, when it has
to happen say thousand times?

> 
>>    Is gdbm better in some ways? Recently you have said dictionary
>> operations are fast ... Once more. I want to turn of locking support.
>> I can make the values as strings of fixed size, if mmap() would be
>> available. The number of keys doesn't grow much in time, mostly
>> there are only updates.
>>
> Also (possibly because I come late to this thread) I don't really 
> understand your caching strategy. I presume at some stage you look in 
> one of the twenty temp dicts, and if you don;t find something you read 
> it back in form disk?

I parse some input file, get one list containing mixed-in words of sizes
1 to 20. The *cache* here means once it get's some amount of data
in the list, the list is converted to dictionary in memory, as there are
many repeated items in th list. Then, I copy the contents of that temporary
dictionary to those dictionaries located on disk (the code moving
the contents of temporary dictionaries to those on disk is the one posted).

> 
> This whole thing seems a little disorganized. Perhaps if you started 
> with a small dataset your testing and development work would proceed 
> more quickly, and you'd be less intimidated by the clear need to 
> refactor your code.

I did and the bottlenecks where in the code. Now the bottleneck is constantly
re-writing almost 20 files of filesizes about 200MB and up.

OK. I think I'll do it another way. I'll just generate words of one size
per each loop, flush the results to disk, and loop back. With this scenario,
I'll read 20 times the input, which seems to be less expensive then
having 20 huge dictionaries on disk constantly updated. Possibly I won't
be able to keep the dictionary in memory and flush it just once to disk,
as the size might be about gigabyte ...? Don't know, probably will have
to keep the *cache* method mentioned above, and update the files several
times.

This is the part I can refactor, but I'll overwrite those huge files
maybe only 20x or 200x instead of 20000 times. Still, I'll search for
a way to update them efficiently. Maybe even mmaped plaintextfiles would be
updated more efficiently than this .db file. Hmm, will probably have
to keep in memory position of the fixed-sized value in such a file.

Thanks for help!
Martin