speed of spambayes?
tim.one at comcast.net
Mon Dec 1 02:27:48 CET 2003
> Can someone using spambayes tell me about how fast it runs?
Well, there are many ways to use spambayes. The bottleneck in most appears
now to be the speed of the specific database you use. In large-scale tests
(using classifiers trained on tens of thousands of ham and spam) at the
start, we used a giant in-memory Python dict to hold the statistics, and
then it scored about 80 messages/second (wall-clock time, including
producing histograms and statistical analyses). I'm sure it's slower now,
as layers of indirection were introduced to allow using a disk-based
database instead, and using a bsddb3 backend appears especially slow
(although spambayes still keeps an in-memory dict cache on top of that,
which should help a lot if you're scoring many messages in a single run).
> I'm using Spamassassin right now but it takes around 1.5 seconds to
> process a message on a 2 ghz Athlon.
The "80 msgs/sec" above was on an 866 MHz P3.
> I believe part of that time is spent doing network lookups to check the
> source addresses against various spam blacklists.
Probably, but you can disable those (or so I've been told by SpamAssassin
folks -- never used it myself).
> I want to crunch through several gigabytes of spam folders to see if
> any legitimate messages got trapped, so need a fast classifier with a
> low false negative rate (it's ok if the false positive rate isn't so
> low, since almost all the messages in these folders are already spam).
SpamBayes knows nothing about ham and spam out of the box: the only things
it knows are what you teach it, so the quality of results depends very much
on the quality of training. Strive to train on an approximately equal
number of ham and spam, to avoid straining the assumptions underlying the
math. Then you should get good results after training on (just) several
hundred of each. The third graph on
shows typical score distribution. Sort your messages by score, and look up
to about 0.60 for misclassified ham. Note that SpamBayes is a 3-way
classifier: Ham, Spam, and "based on how I've been trained, this message
presents such contradictory evidence I refuse to guess". As the graph
shows, the latter group of messages tends to score around 0.5.
More information about the Python-list