[spambayes-dev] Mozilla SpamBayes "porting"

Tim Peters tim.one at comcast.net
Sun Feb 22 12:21:58 EST 2004


[Miguel]
> OK, I made all the suggested changes and re-tested.  The fn rate
> dropped by half, which is amazing considering that it
> was already about half of the original.  Unfourtunately, the fp
> rate did not improve and might have even gone up a bit.
>
> To try to pinpoint my problem I've been trying to debug into
> classifier.py and feed it some numbers.  Unfourtunately I
> don't know my way around the python debugger very well so I haven't
> been able to pull this off.
>
> Is there a kind Python soul in here that could help me with this?
> Feed these numbers into classifier.py and see if you get the same
> results
>
> ngood = 861, nbad = 759
>
> spam score = 0.809734
>
> token 1: hamcount = 13 spamcount = 103, prob=0.898333

...

> token 242: hamcount = 516 spamcount = 828, prob=0.645379

chi2.py has a showscore() function, which displays details about the chi
combining calculation; e.g.,

>>> showscore([.1, .1, .1, .1])
P(chisq >=   0.842884 | v=  8) =   0.999059
P(chisq >=    18.4207 | v=  8) =  0.0182845
spam prob 0.000940523891325
 ham prob 0.981715504484
(S-H+1)/2 0.00961250970379
>>> showscore([.9, .9, .9, .9])
P(chisq >=    18.4207 | v=  8) =  0.0182845
P(chisq >=   0.842884 | v=  8) =   0.999059
spam prob 0.981715504484
 ham prob 0.000940523891325
(S-H+1)/2 0.990387490296
>>> showscore([.1, .1, .9, .9])
P(chisq >=    9.63178 | v=  8) =   0.291827
P(chisq >=    9.63178 | v=  8) =   0.291827
spam prob 0.708173451976
 ham prob 0.708173451976
(S-H+1)/2 0.5
>>>

Sticking your email msg into a string called 'data', then running this
Python snippet:

"""
import re
parse = re.compile(r'prob=([\d.]+)')

probs = [float(prob) for prob in parse.findall(data)]
print "found", len(probs), "probs"
print "first", probs[0], "last", probs[-1]

import sys
sys.path.insert(0, '/code/spambayes') # season to taste
from spambayes.chi2 import showscore
showscore(probs)
"""

printed this:

found 131 probs
first 0.898333 last 0.645379
P(chisq >=    271.809 | v=262) =   0.325528
P(chisq >=    220.223 | v=262) =   0.971459
spam prob 0.674472123467
 ham prob 0.0285410060406
(S-H+1)/2 0.822965558713

So the code in this project would have given a higher spamprob (0.822...)
than your code got (0.809...).  This could very well be due to the
off-by-one error in your chi2Q function.  Indeed, if I change chi2.py's
chi2Q's loop to

    for i in range(1, v//2 + 1):

then the output changes to

found 131 probs
first 0.898333 last 0.645379
P(chisq >=    271.809 | v=262) =   0.357377
P(chisq >=    220.223 | v=262) =   0.976845
spam prob 0.642623035989
 ham prob 0.0231547641273
(S-H+1)/2 0.809734135931

which is an excellent match to what you reported.  You can verify that the
chi-squared values we actually compute are correct by, e.g., using one of
the interactive chi-squared calculators on the web.  For example,

    http://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html




More information about the spambayes-dev mailing list