[spambayes-dev] The naive bayes classifier algorithm in
spambayesdoesn't take in frequency?
kennypitt at hotmail.com
Tue Aug 31 23:36:18 CEST 2004
Austine Jane wrote:
> I have a question on the naive bayes classifier algorithm used in
> I suppose if word1 appeared in three ham mail, the probability of
> word1 being in ham mail would be greater than when it appeared in one
> ham mail:
That depends. The statistics are based on the fraction of ham mail that the
word appeared in, not just the absolute number of times it has occurred. A
word that appeared in 1 ham message out of 1 would have the same probability
as a word that appeared in 3 messages out of 3. A word that appeared in 1
message out of 3 would have a lower probability.
>>>> def tok(s): return s.split()
> As you see the spam probability declines. So far so good.
> And word1 also appeared in one spam mail, but it appeared in three
> ham mail before.
> Hm... Sounds not very right.
> Stays still.
> This doesn't sound intuitive.
Assuming that you started from a clean database file and the training shown
in your example is the only training you've done, then this is exactly
right. If you go on to train the word as ham 1000 more times, you'll still
get 0.5. Here's why:
The base probability for a word is based on ratios:
p = spamratio / (spamratio + hamratio)
where spamratio is the number of spam messages that contained the word
divided by the total number of spam messages, and hamratio is the same but
using only the ham messages.
After training the word 3 times as ham, you had a hamratio of 3 / 3 = 1.0.
You had no spam messages, so your spam ratio was 0. This leads to:
p = 0 / (0 + 1) = 0
Because a word that has been seen only a few times is not a good predictor,
an adjustment is made to the base probability based on the total number of
messages that contained the word:
n = hamcount + spamcount = 3 + 0 = 3
adj_p = ((S * X) + (n * p)) / (S + n)
where S and X are constants. S is the "unknown word strength" with a
default value of 0.45, and X is the "unknown word probability" with a
default value of 0.5 (these are configurable in SpamBayes). When you apply
this adjustment you can see how p = 0 becomes the 0.0652 that you saw, and
also why the value was slightly higher when you had 1 and 2 messages instead
of 3. You can also see that as n approaches infinity, the constant factors
of (S * X) and S become irrelevant, the n terms on top and bottom cancel
out, and you are left with p.
Now as soon as you trained the first instance of the word as spam, your
spamratio became 1 / 1 = 1 also, so your base p becomes:
p = 1.0 / (1.0 + 1.0) = 0.5
>From this point forward, as long as you train only this one word both your
hamratio and spamratio will always be 1 and p will always be 0.5. If you
train some different words and then calculate the spamprob of this word
again, then you will see it start to change from 0.5.
> For example, word1 occurred in 1000
> spam email and occured in 1 ham mail. What is the probability of one
> mail that contains
> word1 being spam mail? Half and half?
Yes, the probability is 0.5 as long as it appeared in 1000 out of 1000 spam
mails and 1 out of 1 ham mail as in your example above. If, on the other
hand, the word appeared in 1000 out of 1000 spams and 1 out of 1000 hams
then the spam probability would be very different, approximately 0.999.
> Doesn't it take in the number
> of occurences(it does seem to take in the number of distinct tokens
> though)? It seems like the concept of the number of occurences and
> the number of distinct tokens are mixed in spambayes' classifier.
No, it only counts each word once in a single mail message. The original
Paul Graham scheme (http://www.paulgraham.com/spam.html) from which
SpamBayes evolved counted the total number of occurrences of the word, but
early testing of SpamBayes showed that accuracy was better if we considered
only the number of messages that contained the word and not the total number
of times that the word appeared.
More information about the spambayes-dev