[Spambayes] Database reduction
Guido van Rossum
guido@python.org
Mon Nov 4 20:44:35 2002
> In case it isn't obvious yet, here's the problem:
>
> >>> len(pickle.dumps(w, 1))
> 102
> >>> len(`w`)
> 30
>
> So, at least for hammie, you can get a 66% reduction in database size
> by *not* pickling WordInfo types. Tim calls this "administrative pickle
> bloat", which is the coolest jargon term I've heard all year.
>
> As I understand it, things which pickle the Bayes object avoid this
> overhead from some pickler optimizations along the lines of "if we've
> already seen this type, just give it a number and stop referring to it
> by name." Thus, I suppose the proper way to get this reduction in
> hammie would be to extend the pickler to recognize WordInfo types,
> right? If so, I'll add that code in.
I'm aware that pickling new-style class instances is inefficient, due
to the gross hack employed. I'll try to find time to do something
about this in Python 2.3.
You could also experiment with adding a custom __reduce__ method
and/or custom __getstate__ and __setstate__ methods. Or pickle tuples
instead of WordInfo instances. Or make WordInfo a classic class
(classic class instances are pickled more efficiently).
--Guido van Rossum (home page: http://www.python.org/~guido/)