[spambayes-dev] improving dumbdbm's survival chances...

Mon Jul 14 01:25:28 EDT 2003

[Tim]
>> BTW, this code in the spambayes storage.py is revolting (having one
>> module change the documented default behavior of another module is
>> almost always indefensible -- I can't see any reason for this abuse
>> in spambayes):
>>
>> """
>> # Make shelve use binary pickles by default.
>> oldShelvePickler = shelve.Pickler
>> def binaryDefaultPickler(f, binary=1):
>>     return oldShelvePickler(f, binary)
>> shelve.Pickler = binaryDefaultPickler
>> """

[Richie]
> I'll hold up my hand and admit to writing that code, in the interests
> of education.

Ah!  Education is a noble pursuit, so you're forgiven <wink>.

> I know it's bad, but I'd like to know the 'right' way
> to do this.
> In 2.2, shelve used text pickles and had no interface to changing
> that (it's been enhanced in 2.3, but this code predates 2.3).

Maybe I'm missing something:  why do you feel we *need* to use binary
pickles (protocol 1 pickles, in current terminology)?  Text pickles
(protocol 0 pickles, in current terminology) work fine.  Protocol 1 pickles
are probably smaller (given how spambayes uses them), but that's it.  In
Python 2.3, protocol 2 pickles would be smaller still (because proto 2 adds
a dedicated opcode for building 2-tuples, of which spambayes uses a ton --
all but one value in the database), but the code as-is will still force
spambayes to use proto 1.

So we're forcing 2.2 installations to do something they weren't intended to
do, and preventing 2.3 installations from doing as well as they could do.
It's also a mystery to me whether it's *intended* that the Shelf constructed
by message.py's MessageInfoDB.__init__ will use a protocol 0 or protocol 1
pickle, depending on whether storage.py happens to get imported by an app
before MessageInfoDB.__init__() gets called.  I suspect that's an unintended
accident, and "stuff like that" (uncontrolled side effects on shared
objects, of which modules are one) is a prime source of excruciatingly
subtle mysteries.

> Subclassing shelve.Shelf and re-implementing the pieces that pickle
> things would break when a new version of shelve is released - as
> happened between 2.2 and 2.3.

We use such a small subset of Shelf's functionality that it may be easier
overall to implement the bit we need directly than to bother subclassing it.

> 2.3's new writeback feature would have quietly failed

I don't understand:  AFAICT, spambayes doesn't use writeback (and has no use
for it -- none of the spambayes database values are mutable).  If that's so,
how could writeback fail?

> had I re-implemented the 2.2 version of __setitem__ like this:
>
> class OurShelf(shelve.Shelf)
>     def __setitem__(self, key, value):
>         f = StringIO()
>         p = Pickler(f, binary=1)   # This line modified from
>                                    # shelve.Shelf
>         p.dump(value)
>         self.dict[key] = f.getvalue()

An appropriate override for spambayes would be more like

    def __setitem__(self, key, value):
        self.dict[key] = cPickle.dumps(value, -1)

We don't care about writeback, and passing -1 uses the most efficient pickle
format the Python version supports (proto 1 in 2.2, proto 2 in 2.3).  But
once __setitem__ has been reduced to a dirt simple 1-liner, I have to
question the overhead and obscurity involved in an extra layer of
indirection to "hide" it.

[... Hmm.  Proto 2 also adds a 2-byte protocol identifier to the start of
each pickle, and that may more than wipe out the space savings from the new
2-tuple opcode.  Maybe proto 1 is still most compact! ...]