[Spambayes] using binary pickles makes for much smaller databases

Richie Hindle richie at entrian.com
Sun Dec 8 15:25:06 EST 2002


> modify Python's shelve.py and Spambayes' storage.py to allow and use binary
> pickles.

Good plan!  (I had no idea that shelve used text pickles.)  Below is an
alternative implementation that avoids the need to change shelve.py, though
it's a slight hack in that a future version of shelve could potentially
break it by not keeping its pickler in a module global called 'Pickler'.
This goes at the top of storage.py:

---------------------------------------------------------------------------

# Make shelve use binary pickles by default.
oldShelvePickler = shelve.Pickler
def binaryDefaultPickler(f, binary=1):
    return oldShelvePickler(f, binary)
shelve.Pickler = binaryDefaultPickler

---------------------------------------------------------------------------

This gives me 335,872 bytes in 21 seconds vs. 679,936 bytes in 26 seconds.
These are wall-clock times on an otherwise-idle Win98 box for training on
200 messages.

This is backwards-compatible too - I can still use my existing database
with no problems.  Can anyone see a problem with this code (or is anyone
offended by grubbing around with shelve.Pickler)?  What if one of the DBMs
supported by anydbm doesn't support values with embedded NULL characters
for instance?  (Seems unlikely.)

Skip, your patch to shelve.py looks like a good candidate for inclusion
into Python itself, assuming there really is no problem using binary
pickles via shelve/anydbm.

-- 
Richie Hindle
richie@entrian.com




More information about the Spambayes mailing list