Writing huge Sets() to disk

Mon Jan 17 07:20:00 EST 2005

Martin MOKREJŠ wrote:

> Hi,
>  could someone tell me what all does and what all doesn't copy
> references in python. I have found my script after reaching some
> state and taking say 600MB, pushes it's internal dictionaries
> to hard disk. The for loop consumes another 300MB (as gathered
> by vmstat) to push the data to dictionaries, then releases
> little bit less than 300MB and the program start to fill-up
> again it's internal dictionaries, when "full" will do the
> flush again ...
> 
>  The point here is, that this code takes a lot of extra memory.
> I believe it's the references problem, and I remeber complains
> of frineds facing same problem. I'm a newbie, yes, but don't
> have this problem with Perl. OK, I want to improve my Pyhton
> knowledge ... :-))
> 
Right ho! In fact I suspect you are still quite new to programming as a 
whole, for reasons that may become clear as we proceed.
> 
> 
> 
>    def push_to_disk(self):
>        _dict_on_disk_tuple = (None, self._dict_on_disk1, 
> self._dict_on_disk2, self._dict_on_disk3, self._dict_on_disk4, 
> self._dict_on_disk5, self._dict_on_disk6, self._dict_on_disk7, 
> self._dict_on_disk8, self._dict_on_disk9, self._dict_on_disk10, 
> self._dict_on_disk11, self._dict_on_disk12, self._dict_on_disk13, 
> self._dict_on_disk14, self._dict_on_disk15, self._dict_on_disk16, 
> self._dict_on_disk17, self._dict_on_disk18, self._dict_on_disk19, 
> self._dict_on_disk20)

It's a bit unfortunate that all those instance variables are global to 
the method, as it means we can't clearly see what you intend them to do. 
However ...

Whenever I see such code, it makes me suspect that the approach to the 
problem could be more subtle. It appears you have decided to partition 
your data into twenty chunks somehow. The algorithm is clearly not coded 
in a way that would make it easy to modify the number of chunks.

[Hint: by "easy" I mean modifying a statement that reads

     chunks = 20

to read

     chunks = 40

for example]. To avoid this, we might use (say) a list of temp edicts, 
for example (the length of this could easily then be parameterized as 
mentioned. So where (my psychic powers tell me) your __init__() method 
currently contains

     self._dict_on_disk1 = something()
     self._dict_on_disk2 = something()
         ...
     self._dict_on_disk20 = something()

I would have written

     self._disk_dicts = []
     for i in range(20):
         self._disk_dicts.append(something)

Than again, I probably have an advantage over you. I'm such a crappy 
typist I can guarantee I'd make at least six mistakes doing it your way :-)

>        _size = 0

What with all these leading underscores I presume it must be VERY 
important to keep these object's instance variables private. Do you have 
a particular reason for that, or just general Perl-induced paranoia? :-)

>        #
>        # sizes of these tmpdicts range from 10-10000 entries for each!
>        for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, 
> self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, 
> self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11, 
> self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15, 
> self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19, 
> self._tmpdict20):
>            _size += 1
>            if _tmpdict:
>                _dict_on_disk = _dict_on_disk_tuple[_size]
>                for _word, _value in _tmpdict.iteritems():
>                    try:
>                        _string = _dict_on_disk[_word]
>                        # I discard _a and _b, maybe _string.find(' ') 
> combined with slice would do better?
>                        _abs_count, _a, _b, _expected_freq = _string.split()
>                        _abs_count = int(_abs_count).__add__(_value)
>                        _t = (str(_abs_count), '0', '0', '0')
>                    except KeyError:
>                        _t = (str(_value), '0', '0', '0')
> 
>                    # this writes a copy to the dict, right?
>                    _dict_on_disk[_word] = ' '.join(_t)
> 
>        #
>        # clear the temporary dictionaries in ourself
>        # I think this works as expected and really does release memory
>        #
>        for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, 
> self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, 
> self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11, 
> self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15, 
> self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19, 
> self._tmpdict20):
>            _tmpdict.clear()
> 
There you go again with that huge tuple. You just like typing, don't 
you? You already wrote that one out just above. Couldn't you have 
assigned it to a local variable?

By the way, remind me again of the reason for the leading None in the 
_dict_on_disk_tuple, would you?

The crucial misunderstanding here might be the meaning of "release 
memory". While clearing the dictionary will indeed remove references to 
the objects formerly contained therein, and thus (possibly) render those 
items subject to garbage collection, that *won't* make the working set 
(i.e. virtual memory pages allocated to your process's data storage) any 
smaller. The garbage collector doesn't return memory to the operating 
system, it merely aggregates it for use in storing new Python objects.
> 
> 
> 
>   The above routine doesn't release of the memory back when it
> exits.
> 
And your evidence for this assertion is ...?
> 
>   See, the loop takes 25 minutes already, and it's prolonging
> as the program is in about 1/3 or 1/4 of the total input.
> The rest of my code is fast in contrast to this (below 1 minute).
> 
> -rw-------  1 mmokrejs users 257376256 Jan 17 11:38 diskdict12.db
> -rw-------  1 mmokrejs users 267157504 Jan 17 11:35 diskdict11.db
> -rw-------  1 mmokrejs users 266534912 Jan 17 11:28 diskdict10.db
> -rw-------  1 mmokrejs users 253149184 Jan 17 11:21 diskdict9.db
> -rw-------  1 mmokrejs users 250232832 Jan 17 11:14 diskdict8.db
> -rw-------  1 mmokrejs users 246349824 Jan 17 11:07 diskdict7.db
> -rw-------  1 mmokrejs users 199999488 Jan 17 11:02 diskdict6.db
> -rw-------  1 mmokrejs users  66584576 Jan 17 10:59 diskdict5.db
> -rw-------  1 mmokrejs users   5750784 Jan 17 10:57 diskdict4.db
> -rw-------  1 mmokrejs users    311296 Jan 17 10:57 diskdict3.db
> -rw-------  1 mmokrejs users 295895040 Jan 17 10:56 diskdict20.db
> -rw-------  1 mmokrejs users 293634048 Jan 17 10:49 diskdict19.db
> -rw-------  1 mmokrejs users 299892736 Jan 17 10:43 diskdict18.db
> -rw-------  1 mmokrejs users 272334848 Jan 17 10:36 diskdict17.db
> -rw-------  1 mmokrejs users 274825216 Jan 17 10:30 diskdict16.db
> -rw-------  1 mmokrejs users 273104896 Jan 17 10:23 diskdict15.db
> -rw-------  1 mmokrejs users 272678912 Jan 17 10:18 diskdict14.db
> -rw-------  1 mmokrejs users 260407296 Jan 17 10:13 diskdict13.db
> 
>    Some spoke about mmaped files. Could I take advantage of that
> with bsddb module or bsddb?
> 
No.

>    Is gdbm better in some ways? Recently you have said dictionary
> operations are fast ... Once more. I want to turn of locking support.
> I can make the values as strings of fixed size, if mmap() would be
> available. The number of keys doesn't grow much in time, mostly
> there are only updates.
> 
Also (possibly because I come late to this thread) I don't really 
understand your caching strategy. I presume at some stage you look in 
one of the twenty temp dicts, and if you don;t find something you read 
it back in form disk?

This whole thing seems a little disorganized. Perhaps if you started 
with a small dataset your testing and development work would proceed 
more quickly, and you'd be less intimidated by the clear need to 
refactor your code.

regards
  Steve
-- 
Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119