[spambayes-dev] Bayesian classifier that uses Bayes factors
Olav
o.laudy at fss.uu.nl
Fri Feb 24 14:00:52 CET 2006
Hello all,
I have a simple idea for the implementation of a Bayesian classifier that uses Bayes factors.
Suppose we have the word "viagra" in the following situation:
The word was found in 10 ham mails, and was not seen in 20 ham mails (=total 30 ham emails)
The word was found in 50 spam mails, and was not seen in 30 spam mails.
The procedure now is to calculate
g(w)=10/(10+20)
b(w) = 50/(50+30)
and then
p(w)=b(w)/(b(w)+g(w))
I suggest the following calculation: first add a prior value of 1 to each cell (so no problem with non-observed words), then calculate the log(odds):
LogOdds=log (( 11*31 ) / (21*51))
The standard deviation is given by stdev = sqrt( 1/11+1/21+1/51+1/31 )
Next is to calculate the Bayes factors that a word is a spam indicator versus that is not a spam indicator:
help=pNorm (0, LogOdds), stdev )
where pNorm is in the words of Gary " the inverse normal function, used to derive a p-value from a normal-distributed random variable"
Bayes factors is given by
BF=help/(1-help)
The interpretation is simple: if the value is larger than 1, it is more
likely being spam. The number can be given a better interpretation, but for
the moment, the criterion is: larger than 1=spam, smaller than 1=ham.
For Bayes factor, the product rule applies: the total Bayes factor is the
product of all the Bayes factors of the individual words in the email to be
classified.
BF_total=BF(word_1) * BF(word_2) *...* BF(word_n)
Some values using 1 word:
H: 10/10 S:50/50 BF=1
H: 100/100 S:500/500 BF=1
-----------------------------------
H: 1/2 S:3/4 BF=1.5
H: 10/20 S:30/40 BF=4.3
-----------------------------------
H: 3/10 S:50/10 BF=very small
Any suggestions?
All the best,
Olav Laudy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20060224/c3adf2fb/attachment.html
More information about the spambayes-dev
mailing list