[spambayes-dev] Mozilla SpamBayes "porting"
Tim Peters
tim.one at comcast.net
Sun Feb 22 12:21:58 EST 2004
[Miguel]
> OK, I made all the suggested changes and re-tested. The fn rate
> dropped by half, which is amazing considering that it
> was already about half of the original. Unfourtunately, the fp
> rate did not improve and might have even gone up a bit.
>
> To try to pinpoint my problem I've been trying to debug into
> classifier.py and feed it some numbers. Unfourtunately I
> don't know my way around the python debugger very well so I haven't
> been able to pull this off.
>
> Is there a kind Python soul in here that could help me with this?
> Feed these numbers into classifier.py and see if you get the same
> results
>
> ngood = 861, nbad = 759
>
> spam score = 0.809734
>
> token 1: hamcount = 13 spamcount = 103, prob=0.898333
...
> token 242: hamcount = 516 spamcount = 828, prob=0.645379
chi2.py has a showscore() function, which displays details about the chi
combining calculation; e.g.,
>>> showscore([.1, .1, .1, .1])
P(chisq >= 0.842884 | v= 8) = 0.999059
P(chisq >= 18.4207 | v= 8) = 0.0182845
spam prob 0.000940523891325
ham prob 0.981715504484
(S-H+1)/2 0.00961250970379
>>> showscore([.9, .9, .9, .9])
P(chisq >= 18.4207 | v= 8) = 0.0182845
P(chisq >= 0.842884 | v= 8) = 0.999059
spam prob 0.981715504484
ham prob 0.000940523891325
(S-H+1)/2 0.990387490296
>>> showscore([.1, .1, .9, .9])
P(chisq >= 9.63178 | v= 8) = 0.291827
P(chisq >= 9.63178 | v= 8) = 0.291827
spam prob 0.708173451976
ham prob 0.708173451976
(S-H+1)/2 0.5
>>>
Sticking your email msg into a string called 'data', then running this
Python snippet:
"""
import re
parse = re.compile(r'prob=([\d.]+)')
probs = [float(prob) for prob in parse.findall(data)]
print "found", len(probs), "probs"
print "first", probs[0], "last", probs[-1]
import sys
sys.path.insert(0, '/code/spambayes') # season to taste
from spambayes.chi2 import showscore
showscore(probs)
"""
printed this:
found 131 probs
first 0.898333 last 0.645379
P(chisq >= 271.809 | v=262) = 0.325528
P(chisq >= 220.223 | v=262) = 0.971459
spam prob 0.674472123467
ham prob 0.0285410060406
(S-H+1)/2 0.822965558713
So the code in this project would have given a higher spamprob (0.822...)
than your code got (0.809...). This could very well be due to the
off-by-one error in your chi2Q function. Indeed, if I change chi2.py's
chi2Q's loop to
for i in range(1, v//2 + 1):
then the output changes to
found 131 probs
first 0.898333 last 0.645379
P(chisq >= 271.809 | v=262) = 0.357377
P(chisq >= 220.223 | v=262) = 0.976845
spam prob 0.642623035989
ham prob 0.0231547641273
(S-H+1)/2 0.809734135931
which is an excellent match to what you reported. You can verify that the
chi-squared values we actually compute are correct by, e.g., using one of
the interactive chi-squared calculators on the web. For example,
http://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html
More information about the spambayes-dev
mailing list