[Spambayes] Paul Graham's math
Tim Peters
tim.one@comcast.net
Tue, 17 Sep 2002 19:51:50 -0400
[Tim]
> ...
> Running a 10-fold cross validation on a collection of 20,000 non-spam
> and 13,750 spam gives a mean false positive rate of 0.02%, and a mean
> false negative rate of 0.20%. There were a grand total of 4 false
> positives and 28 false negatives across the 10 runs.
Sorry to follow up to my own post, but I think you'd find a histogram of the
score distributions revealing. There are the final "probabilities"
multiplied by 100, and broken into bins 2.5 wide:
Ham distribution for all runs:
* = 334 items
0.00 19992 ************************************************************
2.50 2 *
5.00 1 *
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 0
35.00 0
37.50 0
40.00 0
42.50 0
45.00 0
47.50 0
50.00 0
52.50 0
55.00 0
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 1 *
90.00 0
92.50 0
95.00 0
97.50 4 *
Spam distribution for all runs:
* = 229 items
0.00 25 *
2.50 1 *
5.00 1 *
7.50 0
10.00 0
12.50 0
15.00 0
17.50 1 *
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 0
35.00 0
37.50 0
40.00 0
42.50 0
45.00 0
47.50 0
50.00 0
52.50 0
55.00 0
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 0
90.00 2 *
92.50 1 *
95.00 11 *
97.50 13708 ************************************************************
So, e.g., we could have reduced the "spam cutoff" from 0.90 to 0.075 and
gotten only 1 more false positive. The separation is extremely sharp. I
should note that we've already made several changes to Graham's scheme, of
course (mostly eliminating biases).