[Spambayes] Database reduction

Mon Nov 4 17:58:06 2002

So then, Skip Montanaro <skip@pobox.com> is all like:

>     Neale> When pickling a Bayes object, the pickler is smart enough not to
>     Neale> repeatedly say "this is a wordinfo object" but rather, I assume,
>     Neale> "this is of type 2", only having to name type 2 once.  However,
>     Neale> hammie pickles each wordinfo individually, keyed by a string.
>     Neale> This makes for fast lookups, but giant databases.
> 
> You can always define your own __getstate__ and __setstate__ methods for the
> Wordinfo class which processes a more compact form of the object's state.
> Or am I misunderstanding what you said?

Perhaps a picture would be worth 1K words:

    >>> import classifier
    >>> w = classifier.WordInfo('aoeu', 2)
    >>> import pickle
    >>> w
    WordInfo"('aoeu', 0, 0, 0, 2)"
    >>> pickle.dumps(w, 1)
    'ccopy_reg\n_reconstructor\nq\x00(cclassifier\nWordInfo\nq\x01c__builtin__\nobject\nq\x02Ntq\x03R(U\x04aoeuq\x04K\x00K\x00K\x00K\x02tq\x05bq\x06.'

In case it isn't obvious yet, here's the problem:

    >>> len(pickle.dumps(w, 1))
    102
    >>> len(`w`)
    30

So, at least for hammie, you can get a 66% reduction in database size
by *not* pickling WordInfo types.  Tim calls this "administrative pickle
bloat", which is the coolest jargon term I've heard all year.

As I understand it, things which pickle the Bayes object avoid this
overhead from some pickler optimizations along the lines of "if we've
already seen this type, just give it a number and stop referring to it
by name."  Thus, I suppose the proper way to get this reduction in
hammie would be to extend the pickler to recognize WordInfo types,
right?  If so, I'll add that code in.

Neale