[Spambayes] Paul Graham's math

Tim Peters tim.one@comcast.net
Tue, 17 Sep 2002 19:51:50 -0400


[Tim]
> ...
> Running a 10-fold cross validation on a collection of 20,000 non-spam
> and 13,750 spam gives a mean false positive rate of 0.02%, and a mean
> false negative rate of 0.20%.  There were a grand total of 4 false
> positives and 28 false negatives across the 10 runs.

Sorry to follow up to my own post, but I think you'd find a histogram of the
score distributions revealing.  There are the final "probabilities"
multiplied by 100, and broken into bins 2.5 wide:

Ham distribution for all runs:
* = 334 items
  0.00 19992 ************************************************************
  2.50     2 *
  5.00     1 *
  7.50     0
 10.00     0
 12.50     0
 15.00     0
 17.50     0
 20.00     0
 22.50     0
 25.00     0
 27.50     0
 30.00     0
 32.50     0
 35.00     0
 37.50     0
 40.00     0
 42.50     0
 45.00     0
 47.50     0
 50.00     0
 52.50     0
 55.00     0
 57.50     0
 60.00     0
 62.50     0
 65.00     0
 67.50     0
 70.00     0
 72.50     0
 75.00     0
 77.50     0
 80.00     0
 82.50     0
 85.00     0
 87.50     1 *
 90.00     0
 92.50     0
 95.00     0
 97.50     4 *

Spam distribution for all runs:
* = 229 items
  0.00    25 *
  2.50     1 *
  5.00     1 *
  7.50     0
 10.00     0
 12.50     0
 15.00     0
 17.50     1 *
 20.00     0
 22.50     0
 25.00     0
 27.50     0
 30.00     0
 32.50     0
 35.00     0
 37.50     0
 40.00     0
 42.50     0
 45.00     0
 47.50     0
 50.00     0
 52.50     0
 55.00     0
 57.50     0
 60.00     0
 62.50     0
 65.00     0
 67.50     0
 70.00     0
 72.50     0
 75.00     0
 77.50     0
 80.00     0
 82.50     0
 85.00     0
 87.50     0
 90.00     2 *
 92.50     1 *
 95.00    11 *
 97.50 13708 ************************************************************

So, e.g., we could have reduced the "spam cutoff" from 0.90 to 0.075 and
gotten only 1 more false positive.  The separation is extremely sharp.  I
should note that we've already made several changes to Graham's scheme, of
course (mostly eliminating biases).