[spambayes-dev] Bayesian classifier that uses Bayes factors

Olav o.laudy at fss.uu.nl
Fri Feb 24 14:00:52 CET 2006

Hello all,

I have a simple idea for the implementation of a Bayesian classifier that uses Bayes factors.

Suppose we have the word "viagra" in the following situation:

The word was found in 10 ham mails, and was not seen in 20 ham mails (=total 30 ham emails)
The word was found in 50 spam mails, and was not seen in 30 spam mails.

The procedure now is to calculate 

b(w) = 50/(50+30)

and then


I suggest the following calculation: first add a prior value of 1 to each cell (so no problem with non-observed words), then calculate the log(odds):

LogOdds=log  (( 11*31 ) / (21*51))

The standard deviation is given by stdev = sqrt( 1/11+1/21+1/51+1/31 )

Next is to calculate the Bayes factors that a word is a spam indicator versus that is not a spam indicator:

help=pNorm (0, LogOdds), stdev ) 

where pNorm is in the words of Gary " the inverse normal function, used to derive a p-value from a normal-distributed random variable"

Bayes factors is given by


The interpretation is simple: if the value is larger than 1, it is more
likely being spam. The number can be given a better interpretation, but for
the moment, the criterion is: larger than 1=spam, smaller than 1=ham.

For Bayes factor, the product rule applies: the total Bayes factor is the
product of all the Bayes factors of the individual words in the email to be

BF_total=BF(word_1) * BF(word_2) *...* BF(word_n) 

Some values using 1 word:

H: 10/10     S:50/50    BF=1
H: 100/100 S:500/500 BF=1
H: 1/2     S:3/4    BF=1.5
H: 10/20 S:30/40 BF=4.3
H: 3/10 S:50/10  BF=very small

Any suggestions?

All the best,

Olav Laudy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20060224/c3adf2fb/attachment.html 

More information about the spambayes-dev mailing list