[Spambayes] Ham:Spam ratio
tameyer at ihug.co.nz
Thu Feb 19 17:56:44 EST 2004
> My questions are: Should I expect that 39% of my mail is
> unclassified, or should it be less (or more)?
If by "unclassified" you mean "classified as unsure", then no. 2-5% is
> If I retrain to get a ratio of 1:2 or better, will that
> decrease the amount of unclassified mail?
Having a ratio close to 1:1 (even 5:1 or 1:5 is probably ok - 1:121 is way
too much) will help classification, yes.
> If I retrain, how often should I retrain to keep the ratio proper?
There isn't really an accepted training methodology as yet. The wiki
(http://entrian.com/sbwiki) has a *lot* of stuff about training, if you want
to read it. If you don't <wink>, then I'd suggest:
Training on mistakes only - i.e. get rid of your existing training data,
so that everything is unsure (or if you're using the plug-in and there's
still the 5+5 minimum, then go down to that).
Then train on all mail classified as unsure, all good mail classified as
spam, and all spam classified as good mail.
Your database will stay small, and you'll end up with pretty good results,
fairly quickly. For the most part, the ratio tends to stay roughly even
here, too, for various reasons.
> It's probably not the most elegant solution, but I just
> manually move some of my 'ham' (from the inbox, even though
> it was classified properly as ham) into the 'unsure' folder.
> Then highlight these good messages and click "Recover from Spam".
You'll probably find it's better to train *less spam* than *more ham*.
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.
More information about the Spambayes