[Spambayes] Critique of Graham's math

Tim Peters tim.one@comcast.net
Tue, 17 Sep 2002 18:58:06 -0400


<http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html>

appears to be a reasonable critique of Paul Graham's math.  I griped about
some of these things on Python-Dev before the results got so good that I
couldn't care much anymore <0.5 wink>.  Note that we've already removed most
of the biases in Graham's original formulation, but HAMBIAS still does too
much good to get rid of, and I'm still wrestling with trying to give more
weight to indicators with more evidence to back them up.

Gary Robinson has other ideas worth testing, although he's mistaken about
"It is also worth noting that this reduces to Paul's original approach if
you set a=0":  it's another peculiarity of Graham's scheme that
update_probabilities() counts the number of times a word appears in a msg,
but that spamprob() only looks at "no times, or at least once".  Robinson's
claim would be true if update_probabilities() (like spamprob()) also limited
itself to a pure "it's there or it's not there" distinction.

Heh:  I see that Graham's own writeup has quietly changed what we call
UNKNOWN_SPAMPROB from 0.2 to 0.4.  I'm leaving it at our 0.5 <wink>.