[spambayes-dev] Bayesian classifier that uses Bayes factors

Fri Feb 24 14:00:52 CET 2006

Hello all,

I have a simple idea for the implementation of a Bayesian classifier that uses Bayes factors.

Suppose we have the word "viagra" in the following situation:

The word was found in 10 ham mails, and was not seen in 20 ham mails (=total 30 ham emails)
The word was found in 50 spam mails, and was not seen in 30 spam mails.

The procedure now is to calculate 

g(w)=10/(10+20) 
b(w) = 50/(50+30)

and then

p(w)=b(w)/(b(w)+g(w))

I suggest the following calculation: first add a prior value of 1 to each cell (so no problem with non-observed words), then calculate the log(odds):

LogOdds=log  (( 11*31 ) / (21*51))

The standard deviation is given by stdev = sqrt( 1/11+1/21+1/51+1/31 )

Next is to calculate the Bayes factors that a word is a spam indicator versus that is not a spam indicator:

help=pNorm (0, LogOdds), stdev ) 

where pNorm is in the words of Gary " the inverse normal function, used to derive a p-value from a normal-distributed random variable"

Bayes factors is given by

BF=help/(1-help)

The interpretation is simple: if the value is larger than 1, it is more
likely being spam. The number can be given a better interpretation, but for
the moment, the criterion is: larger than 1=spam, smaller than 1=ham.

For Bayes factor, the product rule applies: the total Bayes factor is the
product of all the Bayes factors of the individual words in the email to be
classified.

BF_total=BF(word_1) * BF(word_2) *...* BF(word_n) 

Some values using 1 word:

H: 10/10     S:50/50    BF=1
H: 100/100 S:500/500 BF=1
-----------------------------------
H: 1/2     S:3/4    BF=1.5
H: 10/20 S:30/40 BF=4.3
-----------------------------------
H: 3/10 S:50/10  BF=very small

Any suggestions?

All the best,

Olav Laudy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20060224/c3adf2fb/attachment.html