[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Tim Peters tim_one@email.msn.com
Wed, 21 Aug 2002 00:50:39 -0400

[Eric S. Raymond]
> VM, dude, VM is your friend.  I thought this through carefully.  The
> size of bogofilter's working set isn't limited by core.

Do you expect that to be an issue?  When I built a database from 20,000
messages, the whole thing fit in a Python dict consuming about 10MB.  That's
all.  As is always the case in this kind of thing, about half the tokens are
utterly useless since they appear only once in only one message (think of
things like misspellings and message ids here -- about half the tokens
generated will be "obviously useless", although the useless won't become
obvious until, e.g., a month passes and you never seen the token again).  In
addition, this is a statistical-sampling method, and 20,000 messages is a
very large sample.  I concluded that, in practice, and since we do have ways
to identify and purge useless tokens, 5MB is a reasonable upper bound on the
size of this thing.  It doesn't fit in my L2 cache, but I'd need a
magnifying glass to find it in my RAM.

> And because it's a B-tree variant, the access frequency will be
> to log2 of the wordlist size

I don't believe Judy is faster than Python string-keyed dicts at the sizes
I'm expecting (I've corresponded with Douglas about that too, and his timing
data has a hard time disagreeing <wink>).

> and the patterns will be spatially bursty.

Why?  Do you sort tokens before looking them up?  Else I don't see a reason
to expect that, from one lookup to the next, the paths from root to leaf
will enjoy spatial overlap beyond the root node.

> This is a memory access pattern that plays nice with an LRU pager.

Well, as I said before, all the evidence I've seen says the scoring time for
a message is near-trivial (including the lookup times), even in pure Python.
It's only the parsing that I'm still worried about, and I may yet confess a
bad case of Flex envy.

> I'm working on a simpler solution, one which might have a Pythonic
> Stay tuned.

I figure this means something simpler than a BTree under ZODB.  If so, you
should set yourself a tougher challenge <wink>.