[Spambayes] Training on unusual ham - revisited

Seth Goodman sethg at GoodmanAssociates.com
Thu Feb 9 00:37:50 CET 2006


On Thursday, February 02, 2006 10:35 PM -0600, Bob Posert wrote:

> Back in
>  http://mail.python.org/pipermail/spambayes/2006-January/018702.html
>  , Tim Peters and I had a dialog about training on unusual ham -
> monthly messages from http://www.boldtype.com.  I just got another
> one and it scored 50% on the spam scale.  The clues follow - I'd
> really appreciate any help. Thanks, Bob
>
>  Combined Score: 50% (0.5) Internal ham score (*H*):  1
>  Internal spam score (*S*): 1
>
>  # ham trained on: 1229
>  #  spam trained on: 20331
>   150 Significant Tokens

I couldn't help but notice the ratio of trained spam to trained ham is
very high.  While the statistics _should_ still work properly in these
cases, a number of people have observed difficulties when the number
trained ham and spam are very different.  I don't think anyone has a
good explanation as to why, nor is there any guaranteed "safe" ratio.
As a start, I'd suggest no more than 2:1 in either direction, with maybe
5:1 as an outer bound, but that's just a SWAG (sophisticated wild-ass
guess).  For you to test this, you'd have to retrain, unfortunately.
Save your current databases first, so you can revert if you don't like
the results.

--
Seth Goodman



More information about the SpamBayes mailing list