[Spambayes] Training on unusual ham - revisited

Seth Goodman sethg at GoodmanAssociates.com
Thu Feb 9 00:37:50 CET 2006

On Thursday, February 02, 2006 10:35 PM -0600, Bob Posert wrote:

> Back in
>  http://mail.python.org/pipermail/spambayes/2006-January/018702.html
>  , Tim Peters and I had a dialog about training on unusual ham -
> monthly messages from http://www.boldtype.com.  I just got another
> one and it scored 50% on the spam scale.  The clues follow - I'd
> really appreciate any help. Thanks, Bob
>  Combined Score: 50% (0.5) Internal ham score (*H*):  1
>  Internal spam score (*S*): 1
>  # ham trained on: 1229
>  #  spam trained on: 20331
>   150 Significant Tokens

I couldn't help but notice the ratio of trained spam to trained ham is
very high.  While the statistics _should_ still work properly in these
cases, a number of people have observed difficulties when the number
trained ham and spam are very different.  I don't think anyone has a
good explanation as to why, nor is there any guaranteed "safe" ratio.
As a start, I'd suggest no more than 2:1 in either direction, with maybe
5:1 as an outer bound, but that's just a SWAG (sophisticated wild-ass
guess).  For you to test this, you'd have to retrain, unfortunately.
Save your current databases first, so you can revert if you don't like
the results.

Seth Goodman

More information about the SpamBayes mailing list