[Spambayes] More ham than spam?

Kenny Pitt kennypitt at hotmail.com
Mon Aug 30 18:39:31 CEST 2004


Ferino Mardo wrote:
> The SPAMbayes manager complains that I have much more ham than spam.
> What should one do? Delete his good emails to make things even?

We hear this question a lot, but most people find that they have too much
*spam* and not enough ham.  Ham messages typically have a more consistent
set of senders, receivers, and topics, and therefore usually require less
training to identify correctly than spam messages.

Did you have SpamBayes train itself on some of your existing messages when
you first configured?  If so, you probably had a lot more ham messages in
your initial training set.

If you are getting acceptable accuracy from SpamBayes then don't worry too
much about the warning.  It's only a guideline, and how much affect the
imbalance has will depend on how severe the imbalance is as well as on your
specific mixture of e-mails.

On the other hand, if your accuracy is poor then I would recommend deleting
your training data and retraining SpamBayes from scratch with no initial
training data.  Instead, just train manually on any Unsure messages as well
as messages that SpamBayes identifies incorrectly (ham classified as spam or
vice versa).  We usually refer to this training strategy as "Train on Errors
and Unsures", and you can read more about it on the SpamBayes wiki:

http://entrian.com/sbwiki/TrainOnErrorsAndUnsures

You can also get more information about alternative training strategies
here:

http://entrian.com/sbwiki/TrainingIdeas

-- 
Kenny Pitt



More information about the Spambayes mailing list