[Spambayes] Re: [spambayes-bugs] Spambayes repeatedly classifies messages frommailing list as SPAM despite multiple (20+) recoveries fromspam folder

Brian Schwarz brian at brightrock.com
Thu Sep 4 12:13:27 EDT 2003

Meyer, Tony wrote:

> Do you have really unbalanced numbers of ham & spam?  For example,
> "cannot" is in 171 ham messages, but only 1 spam message - it really
> shouldn't get a score of 0.64.
> Spambayes works best trained with roughly equal numbers of ham & spam;
> we're still trying to come up with a good method of working with
> unbalanced training data.  At the moment there is an option (defaults
> to 'on' in the Outlook plug-in) that adjusts the scores for unbalanced
> mail.  It looks like this is what is happening here - because of the
> imbalance, a perfectly hammy word like "cannot" is getting a 0.64
> score.

OK, that makes sense.  I have ~1000 ham and only ~100 spam messages.  When I
was doing the training, I assumed that more data was preferable, and I had a
lot more stored examples of the good stuff.  I'll try your suggestions.
Even with that hiccup, the program has done a pretty good job out of the



