[Spambayes] Filtering

Tue Jul 13 18:11:44 CEST 2004

Stuart Droker wrote:
> Sometimes, my inbox gets cluttered with Spam.  Most of the time the filter
> seems to work well.  In plain language, Do I use a higher number to tighen
> the controls?  What #'s do you recommend for Spam and for Good?  

The default cutoffs are set to what we "recommend" as a good starting point,
but adjusting the numbers really depends on your personal mix of e-mail.  I
have my thresholds set to 15 for unsure and 60 for spam but that doesn't
mean I would recommend those numbers for everyone.

If you want more spam to be filtered to your Spam folder then you need to
use a *lower* number for the Certain Spam score.  The tradeoff is that you
*increase* the chance that a good message gets filtered to spam and not just
unsure.

To figure out the best setting for your thresholds, you really need to look
at the spam scores that your messages are getting.  You can use "Show spam
clues" to get all the details about a single message, or you can display the
score in your Outlook view as described in Help / About SpamBayes.  Always
look at the score before you train on the message, because training will
cause the score to change from how SpamBayes saw it when the message first
arrived.

First look for the lowest scores given to spam messages and adjust the
Certain Spam score so that only a few anomalies fall into Unsure.  For
example, if the majority of your spam scores above 70 with only a few
messages below that, set your cutoff to 70.  Then look at the highest scores
given to good messages and make sure that the new cutoff you have chosen
gives you plenty of margin for error to prevent false positives.

If you want, you can also adjust the Unsure cutoff similarly.  Reducing the
Unsure cutoff will cause more low-scoring messages to be pushed into the
Unsure category.

Above all, continue to train on the messages that aren't classified
correctly and over time the accuracy of the SpamBayes filter should get
better and better.  If you get a lot more spam than you do good messages,
just be careful not to let your training get too "unbalanced".  If you train
on many times more spam messages than good messages then this can cause the
accuracy to get worse.  We're working on ideas of how the filter can help
keep your training in balance, but for now it is a manual process.

If SpamBayes ever fails to identify a message when it seems the correct
classification should be obvious, just post a copy of the "Show spam clues"
output to the mailing list.  Someone can probably help identify aspects of
your training data that may be throwing SpamBayes off.

-- 
Kenny Pitt