Writing huge Sets() to disk
Steve Holden
steve at holdenweb.com
Mon Jan 17 07:20:00 EST 2005
Martin MOKREJŠ wrote:
> Hi,
> could someone tell me what all does and what all doesn't copy
> references in python. I have found my script after reaching some
> state and taking say 600MB, pushes it's internal dictionaries
> to hard disk. The for loop consumes another 300MB (as gathered
> by vmstat) to push the data to dictionaries, then releases
> little bit less than 300MB and the program start to fill-up
> again it's internal dictionaries, when "full" will do the
> flush again ...
>
> The point here is, that this code takes a lot of extra memory.
> I believe it's the references problem, and I remeber complains
> of frineds facing same problem. I'm a newbie, yes, but don't
> have this problem with Perl. OK, I want to improve my Pyhton
> knowledge ... :-))
>
Right ho! In fact I suspect you are still quite new to programming as a
whole, for reasons that may become clear as we proceed.
>
>
>
> def push_to_disk(self):
> _dict_on_disk_tuple = (None, self._dict_on_disk1,
> self._dict_on_disk2, self._dict_on_disk3, self._dict_on_disk4,
> self._dict_on_disk5, self._dict_on_disk6, self._dict_on_disk7,
> self._dict_on_disk8, self._dict_on_disk9, self._dict_on_disk10,
> self._dict_on_disk11, self._dict_on_disk12, self._dict_on_disk13,
> self._dict_on_disk14, self._dict_on_disk15, self._dict_on_disk16,
> self._dict_on_disk17, self._dict_on_disk18, self._dict_on_disk19,
> self._dict_on_disk20)
It's a bit unfortunate that all those instance variables are global to
the method, as it means we can't clearly see what you intend them to do.
However ...
Whenever I see such code, it makes me suspect that the approach to the
problem could be more subtle. It appears you have decided to partition
your data into twenty chunks somehow. The algorithm is clearly not coded
in a way that would make it easy to modify the number of chunks.
[Hint: by "easy" I mean modifying a statement that reads
chunks = 20
to read
chunks = 40
for example]. To avoid this, we might use (say) a list of temp edicts,
for example (the length of this could easily then be parameterized as
mentioned. So where (my psychic powers tell me) your __init__() method
currently contains
self._dict_on_disk1 = something()
self._dict_on_disk2 = something()
...
self._dict_on_disk20 = something()
I would have written
self._disk_dicts = []
for i in range(20):
self._disk_dicts.append(something)
Than again, I probably have an advantage over you. I'm such a crappy
typist I can guarantee I'd make at least six mistakes doing it your way :-)
> _size = 0
What with all these leading underscores I presume it must be VERY
important to keep these object's instance variables private. Do you have
a particular reason for that, or just general Perl-induced paranoia? :-)
> #
> # sizes of these tmpdicts range from 10-10000 entries for each!
> for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3,
> self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7,
> self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11,
> self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15,
> self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19,
> self._tmpdict20):
> _size += 1
> if _tmpdict:
> _dict_on_disk = _dict_on_disk_tuple[_size]
> for _word, _value in _tmpdict.iteritems():
> try:
> _string = _dict_on_disk[_word]
> # I discard _a and _b, maybe _string.find(' ')
> combined with slice would do better?
> _abs_count, _a, _b, _expected_freq = _string.split()
> _abs_count = int(_abs_count).__add__(_value)
> _t = (str(_abs_count), '0', '0', '0')
> except KeyError:
> _t = (str(_value), '0', '0', '0')
>
> # this writes a copy to the dict, right?
> _dict_on_disk[_word] = ' '.join(_t)
>
> #
> # clear the temporary dictionaries in ourself
> # I think this works as expected and really does release memory
> #
> for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3,
> self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7,
> self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11,
> self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15,
> self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19,
> self._tmpdict20):
> _tmpdict.clear()
>
There you go again with that huge tuple. You just like typing, don't
you? You already wrote that one out just above. Couldn't you have
assigned it to a local variable?
By the way, remind me again of the reason for the leading None in the
_dict_on_disk_tuple, would you?
The crucial misunderstanding here might be the meaning of "release
memory". While clearing the dictionary will indeed remove references to
the objects formerly contained therein, and thus (possibly) render those
items subject to garbage collection, that *won't* make the working set
(i.e. virtual memory pages allocated to your process's data storage) any
smaller. The garbage collector doesn't return memory to the operating
system, it merely aggregates it for use in storing new Python objects.
>
>
>
> The above routine doesn't release of the memory back when it
> exits.
>
And your evidence for this assertion is ...?
>
> See, the loop takes 25 minutes already, and it's prolonging
> as the program is in about 1/3 or 1/4 of the total input.
> The rest of my code is fast in contrast to this (below 1 minute).
>
> -rw------- 1 mmokrejs users 257376256 Jan 17 11:38 diskdict12.db
> -rw------- 1 mmokrejs users 267157504 Jan 17 11:35 diskdict11.db
> -rw------- 1 mmokrejs users 266534912 Jan 17 11:28 diskdict10.db
> -rw------- 1 mmokrejs users 253149184 Jan 17 11:21 diskdict9.db
> -rw------- 1 mmokrejs users 250232832 Jan 17 11:14 diskdict8.db
> -rw------- 1 mmokrejs users 246349824 Jan 17 11:07 diskdict7.db
> -rw------- 1 mmokrejs users 199999488 Jan 17 11:02 diskdict6.db
> -rw------- 1 mmokrejs users 66584576 Jan 17 10:59 diskdict5.db
> -rw------- 1 mmokrejs users 5750784 Jan 17 10:57 diskdict4.db
> -rw------- 1 mmokrejs users 311296 Jan 17 10:57 diskdict3.db
> -rw------- 1 mmokrejs users 295895040 Jan 17 10:56 diskdict20.db
> -rw------- 1 mmokrejs users 293634048 Jan 17 10:49 diskdict19.db
> -rw------- 1 mmokrejs users 299892736 Jan 17 10:43 diskdict18.db
> -rw------- 1 mmokrejs users 272334848 Jan 17 10:36 diskdict17.db
> -rw------- 1 mmokrejs users 274825216 Jan 17 10:30 diskdict16.db
> -rw------- 1 mmokrejs users 273104896 Jan 17 10:23 diskdict15.db
> -rw------- 1 mmokrejs users 272678912 Jan 17 10:18 diskdict14.db
> -rw------- 1 mmokrejs users 260407296 Jan 17 10:13 diskdict13.db
>
> Some spoke about mmaped files. Could I take advantage of that
> with bsddb module or bsddb?
>
No.
> Is gdbm better in some ways? Recently you have said dictionary
> operations are fast ... Once more. I want to turn of locking support.
> I can make the values as strings of fixed size, if mmap() would be
> available. The number of keys doesn't grow much in time, mostly
> there are only updates.
>
Also (possibly because I come late to this thread) I don't really
understand your caching strategy. I presume at some stage you look in
one of the twenty temp dicts, and if you don;t find something you read
it back in form disk?
This whole thing seems a little disorganized. Perhaps if you started
with a small dataset your testing and development work would proceed
more quickly, and you'd be less intimidated by the clear need to
refactor your code.
regards
Steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119
More information about the Python-list
mailing list