[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Sat, 24 Aug 2002 15:29:03 -0400

[Eric S. Raymond]
> (Copied to Paul Graham.  Paul, this is the mailing list of the Python
> maintainers.  I thought you'd find the bits about lexical analysis in
> bogofilter interesting.  Pythonistas, Paul is one of the smartest and
> sanest people in the LISP community, as evidenced partly by the fact
> that he hasn't been too proud to learn some lessons from Python :-).
> It would be a good thing for some bridges to be built here.)

Hi, Paul!  I believe Eric copied you on some concerns I had about the
underpinnings of the algorithm, specifically about the final "is it spam?"
computation.

Looking at your links, I bet you got the formula from here:

    http://www.mathpages.com/home/kmath267.htm

If so, the cause of the difficulty is that you inherited a subtle (because
unstated) assumption from that writeup:

    I would suggest that we assume symmetry between "y" and "n".  In
    other words, assume that probability of predicting correctly is the
    same regardless of whether the correct answer is "y" or "n".

That part's fine.

    This implies p0=p7, p1=p6, p2=p5, and p3=p4,

But that part doesn't follow from *just* the stated assumptions:  note that
those four equalities imply that

    p0+p2+p4+p6 = p7+p5+p3+p1

But the left-hand side of that is the probability that event X does not
occur (it's all the rows with 'n' in the 'R' column), and the right-hand
side is the probability that event X does occur (it's all the rows with 'y'
in the 'R' column).  In other words, this derivation also makes the
stronger-- and unstated --assumption that X occurs with probability 1/2.
The ultimate formula given on that page is correct if P(X)=0.5, but turns
out it's wrong if P(X) isn't 0.5.

Reality doesn't care how accurate Smith and Jones are, X occurs with its own
probability regardless of what they think.  Picture an extreme:  suppose
reality is such that X *always* occurs.  Then p0 must be 0, and so must p2,
p4 and p6 (the rows with 'n' in the R column can never happen if R is always
'y').  But then p0+p2+p4+p6 is 0 too, and the equality above implies
p7+p5+p3+p1 is also 0.  We reach the absurd conclusion that if X always
occurs, the probability that X occurs is 0.  As approximations to 1 go, 0
could stand some improvement <wink>.

The math is easy enough to repair, but it may percolate into other parts of
your algorithm.  Chiefly, I *suspect* you found you needed to boost the
"good count" by a factor of 2 because you actually have substantially more
non-spam than spam in your inbox, and the scoring step was favoring "spam"
more than it should have by virtue of neglecting to correct for that your
real-life P(spam) is significantly less than 0.5 (although your training
corpora had about the same number of spams as non-spams, so that P(spam)=0.5
was aprroximately true across your *training* data -- that's another
complication).

Makes sense?  Once our testing setup is trustworthy, I'll try it both ways
and report on results.  In the meantime, it's something worth pondering.