[spambayes-dev] A new and altogether different bsddb breakage

Barry Warsaw barry at python.org
Wed Dec 17 21:17:49 EST 2003


On Wed, 2003-12-17 at 20:36, Tim Peters wrote:

> The token statistics database now is a single (but large) mapping from short
> 8-bit strings to 2-tuples of little integers.  The strings are usually less
> than 16 characters, and never a lot longer than that (the tokenizer
> truncates very long strings, synthesizing short "skip" tokens as proxies).

The raw bsddb interface wants keys and values to be strings and for
btree access methods, the length doesn't really matter.  You could
pickle the 2-tuples or just do something easily splittable like '%s|%s'
% two_tuple.

Sounds like one BTree table would do the trick there.

> It would be nice to have other mappings too, like forward and inverse msgid
> <-> bag_of_tokens maps.  A little-integer timestamp may get added to the
> 2-tuples.

Each of those would be a separate table, of course.  bag_of_token maps
sounds like you'd want to pickle the data value.

-Barry





More information about the spambayes-dev mailing list