
[Eric S. Raymond]
(Copied to Paul Graham. Paul, this is the mailing list of the Python maintainers. I thought you'd find the bits about lexical analysis in bogofilter interesting. Pythonistas, Paul is one of the smartest and sanest people in the LISP community, as evidenced partly by the fact that he hasn't been too proud to learn some lessons from Python :-). It would be a good thing for some bridges to be built here.)
Hi, Paul! I believe Eric copied you on some concerns I had about the underpinnings of the algorithm, specifically about the final "is it spam?" computation. Looking at your links, I bet you got the formula from here: http://www.mathpages.com/home/kmath267.htm If so, the cause of the difficulty is that you inherited a subtle (because unstated) assumption from that writeup: I would suggest that we assume symmetry between "y" and "n". In other words, assume that probability of predicting correctly is the same regardless of whether the correct answer is "y" or "n". That part's fine. This implies p0=p7, p1=p6, p2=p5, and p3=p4, But that part doesn't follow from *just* the stated assumptions: note that those four equalities imply that p0+p2+p4+p6 = p7+p5+p3+p1 But the left-hand side of that is the probability that event X does not occur (it's all the rows with 'n' in the 'R' column), and the right-hand side is the probability that event X does occur (it's all the rows with 'y' in the 'R' column). In other words, this derivation also makes the stronger-- and unstated --assumption that X occurs with probability 1/2. The ultimate formula given on that page is correct if P(X)=0.5, but turns out it's wrong if P(X) isn't 0.5. Reality doesn't care how accurate Smith and Jones are, X occurs with its own probability regardless of what they think. Picture an extreme: suppose reality is such that X *always* occurs. Then p0 must be 0, and so must p2, p4 and p6 (the rows with 'n' in the R column can never happen if R is always 'y'). But then p0+p2+p4+p6 is 0 too, and the equality above implies p7+p5+p3+p1 is also 0. We reach the absurd conclusion that if X always occurs, the probability that X occurs is 0. As approximations to 1 go, 0 could stand some improvement <wink>. The math is easy enough to repair, but it may percolate into other parts of your algorithm. Chiefly, I *suspect* you found you needed to boost the "good count" by a factor of 2 because you actually have substantially more non-spam than spam in your inbox, and the scoring step was favoring "spam" more than it should have by virtue of neglecting to correct for that your real-life P(spam) is significantly less than 0.5 (although your training corpora had about the same number of spams as non-spams, so that P(spam)=0.5 was aprroximately true across your *training* data -- that's another complication). Makes sense? Once our testing setup is trustworthy, I'll try it both ways and report on results. In the meantime, it's something worth pondering.