[Spambayes] How to cure extreme disproportion of spam to ham??
tameyer at ihug.co.nz
Thu Feb 26 18:52:50 EST 2004
> Spambayes manager warns me that I have spam:ham
> disproportion, 454:3365 to be exact. I cannot figure out what
> to do about this.
Firstly, if you're still getting results that you're happy with, you don't
really have to do anything. However, if your results could be better, then
read on :)
> I get a ton of spam and the program puts most of
> it in the "Suspects" folder.
This does sound like a problem. Numbers will vary from email mix to email
mix, but having around 2-5% of your mail end up in the "Suspects" folder
would be reasonable. "Most" doesn't sound reasonable at all. This could
well be a result of the imbalance, though.
> How do I fix this imbalance?
Suggestions, although there is a lot of stuff about training on the wiki
<http://entrian.com/sbwiki>, and it's not at all an exact science:
1. Try running with a smaller database. Dump the one you have (perhaps
just rename it so that you can go back to it if you want to), train on 5
random hams and spams to get you going, and just classify whatever ends up
in the "Suspects" folder, and any ham in the spam folder and spam in the ham
folder. If it starts getting imbalanced, then try just training on *some*
of the messages in the "Suspects" folder, and moving/deleting the rest (or
rescoring the "Suspects" folder, although this isn't as convenient as it
could be, and seeing if the remaining messages are now correctly
2. If you almost never see a ham in the "Suspects" folder, and never see
one over (eg) 80%, then lower the spam threshold. I think a lot of people
run with it at 80% anyway, although the default is still IIRC 90%.
Hope this helps.
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.
More information about the Spambayes