Size of dictionary

Pete Goodeve pete at jwgibbs.cchem.Berkeley.EDU
Thu May 29 15:43:48 EDT 2003


I'm seeing some rather surprising (to me) behaviour when I save a
dictionary to a file with cPickle.  When I remove a fraction of the
existing entries in the dictionary, the size of the saved file can
*increase* by something like 50%!

(To be specific, I have a Bayesian spam filter that maintains a table
of words and probabilities, but when the file starts to get too large
I want to reduce its size by dumping lower probability words.  Having
it become *larger* doesn't help!)

In fact size seems to be surprisingly unrelated to content.  I first
noticed it when I tried to trim 500 words out of 5500; the original
file was 80K -- it became 120K after 'trimming'!  Working with a test
file, of 3963 bytes, I added one word ('snork' to be precise...) and
the result was *3910* bytes.  Removed it again, and the file dropped
in size *again* to 3895...

I realize this is due to the hashing used, but it still seems a bit
odd.  My question really is: is there any way to minimize the size
of the file (I assume it corresponds to the dictionary itself)?

Ta,
					-- Pete --





More information about the Python-list mailing list