[spambayes-dev] A new and altogether different bsddb breakage

Tim Peters tim.one at comcast.net
Wed Dec 17 21:47:29 EST 2003


[Tim]
>> The token statistics database now is a single (but large) mapping
>> from short 8-bit strings to 2-tuples of little integers.  The
>> strings are usually less than 16 characters, and never a lot longer
>> than that (the tokenizer truncates very long strings, synthesizing
>> short "skip" tokens as proxies).

[Barry Warsaw]
> The raw bsddb interface wants keys and values to be strings and for
> btree access methods, the length doesn't really matter.  You could
> pickle the 2-tuples or just do something easily splittable like
> '%s|%s' % two_tuple.

We already pickle this stuff, but it goes through the shelve module so
pretends to be transparent.  I want to get shelve out of it anyway, because
shelve adds little value at high cost (there are too many layers of
indirection through Python-level methods now -- slooooow).  There are very
few textual sites where pickle<->unpickle dances are needed (that's already
been cleanly factored out).

> Sounds like one BTree table would do the trick there.

Yup.  We're using BDB hash now.  I don't know that this was a conscious
decision.  I'd ask whether BDB hash or BDB BTree would be faster, but I
don't want to put you on the spot <wink>.

>> It would be nice to have other mappings too, like forward and
>> inverse msgid <-> bag_of_tokens maps.  A little-integer timestamp
>> may get added to the 2-tuples.

> Each of those would be a separate table, of course.  bag_of_token maps
> sounds like you'd want to pickle the data value.

They would be very much like the indices we build for full search in
ZCTextIndex.  This is easy to do with ZODB's IO and OO flavors of BTree,
because BTree values can also be BTrees (etc), and all the pieces are
automagically cut down to reasonably small storage chunks then.  I'd ask
whether BDB supports something similar, but ... <heh>.

the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the-
    code-ly y'rs  - tim




More information about the spambayes-dev mailing list