[spambayes-dev] Spam Clues: <>< STOP! Looking for anti
christianchristians
Kenny Pitt
kennypitt at hotmail.com
Fri Jul 2 15:50:52 EDT 2004
What's most likely causing this is the imbalance in your training. SpamBayes
is most accurate if you can train on approximately the same number of ham
messages as you do spam messages. A ratio of up to 5 to 1 or so is probably
fine, but your ratio is currently about 44 to 1 towards spam which will
heavily bias all your results towards ham.
For example, the token "christianity" appears 10 times in ham and 7 times in
spam, roughly the same number of times. However, the spam probability of
that token is only .028 because the most basic component of the statistics
on which SpamBayes is based is the percentage of messages that contain the
token. This token appears in 10 out of 140 ham messages for a ham percentage
of 7.14%, and it appears in 7 out of 6168 spam messages for a spam
percentage of only 0.11%. The ham percentage is almost 63x larger than the
spam percentage.
With an imbalance this large, your best bet is probably to delete your
training data and train again from scratch. Try starting out without feeding
SpamBayes any existing messages for initial training, and then train only on
mistakes and unsures. If you see several spam messages in your unsure folder
that look similar, try training on only one of them and deleting the rest to
avoid training on too many spams.
--
Kenny Pitt
_____
From: spambayes-dev-bounces at python.org
[mailto:spambayes-dev-bounces at python.org] On Behalf Of G. Waleed Kavalec
Sent: Friday, July 02, 2004 1:51 PM
To: spambayes-dev at python.org
Subject: [spambayes-dev] Spam Clues: <>< STOP! Looking for anti
christianchristians
This thing won't die.
It doesn't even go to 'maybe'.
"What's up with that?"
Combined Score: 0% (3.16545e-005)
Internal ham score (*H*): 1
Internal spam score (*S*): 6.3309e-005
# ham trained on: 140
# spam trained on: 6168
150 Significant Tokens
token spamprob #ham #spam
'religions' 0.027636 9 5
'christianity' 0.0281306 10 7
'jesus,' 0.0281306 10 7
'religion,' 0.0282139 12 10
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040702/fc8a3afa/attachment.html
More information about the spambayes-dev
mailing list