[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Tue, 20 Aug 2002 19:57:04 -0400

[Eric S. Raymond]
> I'm in the process of speed-tuning this now.  I intend for it to be
> blazingly fast, usable for sites that process 100K mails a day, and I
> think I know how to do that.  This is not a natural application for
> Python :-).

I'm not sure about that.  The all-Python version I checked in added 20,000
Python-Dev messages to the database in 2 wall-clock minutes.  The time for
computing the statistics, and for scoring, is simply trivial (this wouldn't
be true of a "normal" Bayesian classifier (NBC), but Graham skips most of
the work an NBC does, in particular favoring fast classification time over
fast model-update time).

What we anticipate is that the vast bulk of the time will end up getting
spent on better tokenization, such as decoding base64 portions, and giving
special care to header fields and URLs.  I also *suspect* (based on a
previous life in speech recogniation) that experiments will show that a
mixture of character n-grams and word bigrams is significantly more
effective than a "naive" tokenizer that just looks for US ASCII alphanumeric
runs.

>> """My finding is that it is _nowhere_ near sufficient to have two
>> populations, "spam" versus "not spam."

> Well, except it seems to work quite well.  The Nigerian trigger-word
> population is distinct from the penis-enlargement population, but they
> both show up under Bayesian analysis.

In fact, I'm going to say "Nigerian" and "penis enlargement" one more time
each here, just to demonstrate that *this* message won't be a false positive
when the smoke settles <wink>.  Human Growth Hormone too, while I'm at it.