[Spambayes] Database reduction
Neale Pickett
neale@woozle.org
Mon Nov 4 17:58:06 2002
So then, Skip Montanaro <skip@pobox.com> is all like:
> Neale> When pickling a Bayes object, the pickler is smart enough not to
> Neale> repeatedly say "this is a wordinfo object" but rather, I assume,
> Neale> "this is of type 2", only having to name type 2 once. However,
> Neale> hammie pickles each wordinfo individually, keyed by a string.
> Neale> This makes for fast lookups, but giant databases.
>
> You can always define your own __getstate__ and __setstate__ methods for the
> Wordinfo class which processes a more compact form of the object's state.
> Or am I misunderstanding what you said?
Perhaps a picture would be worth 1K words:
>>> import classifier
>>> w = classifier.WordInfo('aoeu', 2)
>>> import pickle
>>> w
WordInfo"('aoeu', 0, 0, 0, 2)"
>>> pickle.dumps(w, 1)
'ccopy_reg\n_reconstructor\nq\x00(cclassifier\nWordInfo\nq\x01c__builtin__\nobject\nq\x02Ntq\x03R(U\x04aoeuq\x04K\x00K\x00K\x00K\x02tq\x05bq\x06.'
In case it isn't obvious yet, here's the problem:
>>> len(pickle.dumps(w, 1))
102
>>> len(`w`)
30
So, at least for hammie, you can get a 66% reduction in database size
by *not* pickling WordInfo types. Tim calls this "administrative pickle
bloat", which is the coolest jargon term I've heard all year.
As I understand it, things which pickle the Bayes object avoid this
overhead from some pickler optimizations along the lines of "if we've
already seen this type, just give it a number and stop referring to it
by name." Thus, I suppose the proper way to get this reduction in
hammie would be to extend the pickler to recognize WordInfo types,
right? If so, I'll add that code in.
Neale