[Spambayes] Moving between i686 and x86_64 'systems'

Tim Peters tim.peters at gmail.com
Thu Apr 20 12:57:38 CEST 2006


[Tim]
>> That too:  ZODB uses Python pickles to store object state, and
>> inherits platform-independence from that.  ZODB's FileStorage format
>> is also platform-independent.  If we're using "I" flavors of ZODB
>> BTrees,

[Tony]
> I don't think we are.  wordinfo, which I believe is the only BTree,
> is a OOBTree [token to (hamcount, spamcount)].  The ZODB storage is
> fairly rough, since I'm really only just learning how to properly use
> ZODB.

SpamBayes was originally designed with ZODB in mind, so there are
secret pressures ensuring that you'll succeed :-)

> Would two OIBTrees (token to count) be more efficient than a
> single OOBTree?  (And do we care if it would be?)

Only way to know for sure is to try both and measure, but I wouldn't
bother.  `wordinfo` _conceptually_ maps a string to a pair of
integers, so an OOBTree is the clearest implementation.  It's also
_probably_ the fastest:  two lookup operations per string potentially
means twice as much disk I/O to traverse two distinct BTree
structures.

BTW, ZODB has an in-memory object cache, to avoid disk I/O for objects
already fetched from disk.  The default cache size is something like
400 (objects), which is much smaller than SpamBayes could make good
use of.  Specifying cache_size=10000 would be a better starting point
("the objects" stored by SB are relatively tiny, while the default is
geared more to use in Zope, where "an object" is typically much
larger).

>> a bug in those makes it possible to lose information silently
>> on a 64-bits box when storing integers that don't actually fit in 32
>> bits:
>>
>>     http://www.zope.org/Collectors/Zope/1592
>>
>> I strongly doubt SpamBayes tickles that bug, though.

> Well, it could when someone has 4.3 billion ham or spam trained,
> right? ;)

Yes ;-)  If we use an OOBTree, that bug doesn't arise, and there's no
inherent limit on integer sizes then (if they overflow to Python
unbounded ints, an OOBTree is just as happy with those as with
"little" ints).


More information about the SpamBayes mailing list