[Spambayes] Database reduction

Mon Nov 4 20:44:35 2002

> In case it isn't obvious yet, here's the problem:
> 
>     >>> len(pickle.dumps(w, 1))
>     102
>     >>> len(`w`)
>     30
> 
> So, at least for hammie, you can get a 66% reduction in database size
> by *not* pickling WordInfo types.  Tim calls this "administrative pickle
> bloat", which is the coolest jargon term I've heard all year.
> 
> As I understand it, things which pickle the Bayes object avoid this
> overhead from some pickler optimizations along the lines of "if we've
> already seen this type, just give it a number and stop referring to it
> by name."  Thus, I suppose the proper way to get this reduction in
> hammie would be to extend the pickler to recognize WordInfo types,
> right?  If so, I'll add that code in.

I'm aware that pickling new-style class instances is inefficient, due
to the gross hack employed.  I'll try to find time to do something
about this in Python 2.3.

You could also experiment with adding a custom __reduce__ method
and/or custom __getstate__ and __setstate__ methods.  Or pickle tuples
instead of WordInfo instances.  Or make WordInfo a classic class
(classic class instances are pickled more efficiently).

--Guido van Rossum (home page: http://www.python.org/~guido/)