[spambayes-dev] Understanding Classifier Code
danieleloff at hotmail.com
Wed Feb 18 11:55:04 EST 2004
I've been looking at ways of increasing the speed at which the classifier
runs. I think that increasing the speed of the classifier would be one of
the things required to make spam-bayes computationally worthwhile to run on
a gateway for several thousand users. (others include keeping all word
records in memory, and writing the really tight parts into C or assembler)
No arguments yet please, I just need some help to understand what the
classifier is doing and why. I'll paste the code and interject when i don't
understand why something is happening that way. (I don't like to modify code
without understanding it fully, you're asking for trouble if you do that.) I
really appreciate any help you can give me on this, the more the merrier.
And if my team manages to get spambayes working well on our server we'll be
sure to share the modifications with you.
H = S = 1.0
Hexp = Sexp = 0
clues = self._getclues(wordstream)
for prob, word, record in clues:
S *= 1.0 - prob
H *= prob
if S < 1e-200: # prevent underflow
S, e = frexp(S)
Sexp += e
if H < 1e-200: # prevent underflow
H, e = frexp(H)
Hexp += e
Tell me, why a seperate spam/ham score at this point?
# Compute the natural log of the product = sum of the logs:
# ln(x * 2**i) = ln(x) + i * ln(2).
S = ln(S) + Sexp * LN2
H = ln(H) + Hexp * LN2
Okay i can see from the comment that this is equiv to ln(x * 2**i)
But why take the logarithim of the final spam/ham prob?
n = len(clues)
S = 1.0 - chi2Q(-2.0 * S, 2*n)
H = 1.0 - chi2Q(-2.0 * H, 2*n)
Why multiply the score by -2? why double n?
def chi2Q(x2, v, exp=_math.exp, min=min):
"""Return prob(chisq >= x2, with v degrees of freedom).
v must be even.
assert v & 1 == 0
# XXX If x2 is very large, exp(-m) will underflow to 0.
m = x2 / 2.0
sum = term = exp(-m)
for i in range(1, v//2):
term *= m / i
sum += term
What's going on here? Why do we take the exp() of -x2/2?
Why did we multiply x2 by 2 if we divide it by 2 again anyway?
What is the loop doing and why?
# With small x2 and large v, accumulated roundoff error, plus error in
# the platform exp(), can cause this to spill a few ULP above 1.0. For
# example, chi2Q(100, 300) on my box has sum == 1.0 + 2.0**-52 at this
# point. Returning a value even a teensy bit over 1.0 is no good.
return min(sum, 1.0)
Dream of owning a home? Find out how in the First-time Home Buying Guide.
More information about the spambayes-dev