[spambayes-dev] Spam Clues: <>< STOP! Looking for anti christianchristians

Kenny Pitt kennypitt at hotmail.com
Fri Jul 2 15:50:52 EDT 2004


What's most likely causing this is the imbalance in your training. SpamBayes
is most accurate if you can train on approximately the same number of ham
messages as you do spam messages. A ratio of up to 5 to 1 or so is probably
fine, but your ratio is currently about 44 to 1 towards spam which will
heavily bias all your results towards ham.
 
For example, the token "christianity" appears 10 times in ham and 7 times in
spam, roughly the same number of times. However, the spam probability of
that token is only .028 because the most basic component of the statistics
on which SpamBayes is based is the percentage of messages that contain the
token. This token appears in 10 out of 140 ham messages for a ham percentage
of 7.14%, and it appears in 7 out of 6168 spam messages for a spam
percentage of only 0.11%. The ham percentage is almost 63x larger than the
spam percentage.
 
With an imbalance this large, your best bet is probably to delete your
training data and train again from scratch. Try starting out without feeding
SpamBayes any existing messages for initial training, and then train only on
mistakes and unsures. If you see several spam messages in your unsure folder
that look similar, try training on only one of them and deleting the rest to
avoid training on too many spams.
 
-- 
Kenny Pitt
 


  _____  

From: spambayes-dev-bounces at python.org
[mailto:spambayes-dev-bounces at python.org] On Behalf Of G. Waleed Kavalec
Sent: Friday, July 02, 2004 1:51 PM
To: spambayes-dev at python.org
Subject: [spambayes-dev] Spam Clues: <>< STOP! Looking for anti
christianchristians



This thing won't die.


It doesn't even go to 'maybe'.


"What's up with that?"


 


Combined Score: 0% (3.16545e-005)


Internal ham score (*H*): 1
Internal spam score (*S*): 6.3309e-005

# ham trained on: 140
# spam trained on: 6168


150 Significant Tokens

token                               spamprob         #ham  #spam
'religions'                         0.027636            9      5
'christianity'                      0.0281306          10      7
'jesus,'                            0.0281306          10      7
'religion,'                         0.0282139          12     10
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040702/fc8a3afa/attachment.html


More information about the spambayes-dev mailing list