[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Eric S. Raymond esr@thyrsus.com
Wed, 21 Aug 2002 01:35:56 -0400


Tim Peters <tim_one@email.msn.com>:
> [Eric S. Raymond]
> > VM, dude, VM is your friend.  I thought this through carefully.  The
> > size of bogofilter's working set isn't limited by core.
> 
> Do you expect that to be an issue?  When I built a database from 20,000
> messages, the whole thing fit in a Python dict consuming about 10MB.

Hm, that's a bit smaller than I would have thought, but the order of 
magnitude I was expecting.

>                                                                     That's
> all.  As is always the case in this kind of thing, about half the tokens are
> utterly useless since they appear only once in only one message (think of
> things like misspellings and message ids here -- about half the tokens
> generated will be "obviously useless", although the useless won't become
> obvious until, e.g., a month passes and you never seen the token again). 

Recognition features should age!  Wow!  That's a good point!  With the age
counter being reset when they're recognized.

> > and the patterns will be spatially bursty.
> 
> Why?  Do you sort tokens before looking them up? 

I thought part of the point of the method was that you get sorting for free
because of the way elements are inserted.

>                                             Else I don't see a reason
> to expect that, from one lookup to the next, the paths from root to leaf
> will enjoy spatial overlap beyond the root node.

No, but think about how the pointer in a binary search moves.  It's
spatially bursty, Memory accesses frequencies for repeated binary
searches will be a sum of bursty signals, analogous to the way
network traffic volumes look in the time domain.  In fact the
graph of memory adress vs. number of accesses is gonna win up
looking an awful lot like 1/f noise, I think.  *Not* evenly
distributed; something there for LRU to weork with.

> > I'm working on a simpler solution, one which might have a Pythonic spinoff.
> > Stay tuned.
> 
> I figure this means something simpler than a BTree under ZODB.  If so, you
> should set yourself a tougher challenge <wink>.

What I'm starting to test now is a refactoring of the program where it
spawn a daemon version of itself first time it's called.  The daemon
eats the wordlists and stays in core fielding requests from subsequent
program runs.  Basically an answer to "how you call bogofilter 1K
times a day from procmail without bringing your disks to their knees"
problem" -- persistence on the cheap.

Thing is that the solution to this problem is very generic.  Might
turn into a Python framework.
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>