
[Eric S. Raymond]
I'm in the process of speed-tuning this now. I intend for it to be blazingly fast, usable for sites that process 100K mails a day, and I think I know how to do that. This is not a natural application for Python :-).
I'm not sure about that. The all-Python version I checked in added 20,000 Python-Dev messages to the database in 2 wall-clock minutes. The time for computing the statistics, and for scoring, is simply trivial (this wouldn't be true of a "normal" Bayesian classifier (NBC), but Graham skips most of the work an NBC does, in particular favoring fast classification time over fast model-update time). What we anticipate is that the vast bulk of the time will end up getting spent on better tokenization, such as decoding base64 portions, and giving special care to header fields and URLs. I also *suspect* (based on a previous life in speech recogniation) that experiments will show that a mixture of character n-grams and word bigrams is significantly more effective than a "naive" tokenizer that just looks for US ASCII alphanumeric runs.
"""My finding is that it is _nowhere_ near sufficient to have two populations, "spam" versus "not spam."
Well, except it seems to work quite well. The Nigerian trigger-word population is distinct from the penis-enlargement population, but they both show up under Bayesian analysis.
In fact, I'm going to say "Nigerian" and "penis enlargement" one more time each here, just to demonstrate that *this* message won't be a false positive when the smoke settles <wink>. Human Growth Hormone too, while I'm at it.