[Spambayes] There Can Be Only One
Guido van Rossum
guido@python.org
Tue, 24 Sep 2002 22:50:44 -0400
> Wow! 0.625 is the largest reported "best value" to date (Neil and I
> both needed 0.6 once; the suggested 0.55 was optimal for my run --
> which is why I suggested it <wink>).
0.6 is the *lowest* that works for me -- any lower and Graham
wins on both fp's and fn's.
> > (after looking at the histograms of a trial run).
> >
> > Net result:
> >
> > Graham: 8 fp's, 13 fn's.
> > Gary: 7 fp's, 13 fn's.
> >
> > I ran the same two tests with a different random number:
> >
> > Graham: 8 fp's, 11 fn's.
> > Gary: 6 fp's, 16 fn's.
> >
> > According to the histogram, Gary would have given 8 fp's and 10 fn's
> > with a cutoff of 0.6, again beating Graham with the smallest margin.
>
> Have you used cmp.py? It generates side-by-side listings comparing
> two runs, and it's extremely important to know how often each scheme
> beat the other; the summary numbers reveal nothing about that. For
> this particular test, the tail end of the cmp.py output (with the
> changes in ham and spam score means and sdevs) is useless. When
> tweaking paramters for a single scheme, though, all the cmp.py
> output is dripping with valuable clues.
Ok. For the first pair:
false positive percentages
0.500 1.000 lost +100.00%
0.000 0.500 lost +(was 0)
0.500 0.500 tied
1.000 0.500 won -50.00%
1.000 0.500 won -50.00%
0.000 0.000 tied
0.500 0.000 won -100.00%
0.000 0.000 tied
0.500 0.500 tied
0.000 0.000 tied
won 3 times
tied 5 times
lost 2 times
total unique fp went from 8 to 7 won -12.50%
mean fp % went from 0.4 to 0.35 won -12.50%
false negative percentages
1.000 0.000 won -100.00%
1.500 1.500 tied
0.500 0.500 tied
0.500 0.000 won -100.00%
0.500 0.000 won -100.00%
0.500 0.500 tied
1.000 1.000 tied
0.500 1.000 lost +100.00%
0.500 2.000 lost +300.00%
0.000 0.000 tied
won 3 times
tied 5 times
lost 2 times
total unique fn went from 13 to 13 tied
mean fn % went from 0.65 to 0.65 tied
For the second pair:
false positive percentages
0.500 0.500 tied
0.000 0.000 tied
0.000 0.000 tied
0.500 0.000 won -100.00%
0.500 0.500 tied
0.500 0.500 tied
0.000 0.000 tied
0.500 0.500 tied
0.000 0.000 tied
1.500 1.000 won -33.33%
won 2 times
tied 8 times
lost 0 times
total unique fp went from 8 to 6 won -25.00%
mean fp % went from 0.4 to 0.3 won -25.00%
false negative percentages
0.500 0.500 tied
1.000 0.500 won -50.00%
0.000 0.500 lost +(was 0)
1.000 1.000 tied
0.000 1.000 lost +(was 0)
1.000 1.000 tied
0.500 1.000 lost +100.00%
0.000 0.000 tied
0.500 0.500 tied
1.000 2.000 lost +100.00%
won 1 times
tied 5 times
lost 4 times
total unique fn went from 11 to 16 lost +45.45%
mean fn % went from 0.55 to 0.8 lost +45.45%
> > I found one more spam in my ham (but didn't remove it between these
> > runs). I also found 10 empty messages in Bruce Guenter's spam
> > archives! (1 in 2002/01, 2 in 2002/05, 7 in 2002/06.) Tim, I presume
> > you cleaned these out long ago?
>
> No, I've left empty messages in both my corpora. Although I'm unclear on
> what you mean by "empty". I mean they have no body, but do have message
> headers. Sometimes a c.l.py ham consists solely of a question in the
> Subject line!
No, I have 10 files that have length 0 (i.e. no headers and no body)
in BruceG's original bz2 files. I checked, and the tar listing has
these too.
Here are the results for my first Gary variation: changing
max_discriminators to 1500, while keeping the cutoff at 0.60:
false positive percentages
0.500 0.500 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.500 0.500 tied
0.000 0.500 lost +(was 0)
0.500 0.500 tied
0.000 0.000 tied
1.000 1.500 lost +50.00%
won 0 times
tied 8 times
lost 2 times
total unique fp went from 6 to 8 lost +33.33%
mean fp % went from 0.3 to 0.4 lost +33.33%
false negative percentages
0.500 0.000 won -100.00%
0.500 0.500 tied
0.500 0.500 tied
1.000 0.500 won -50.00%
1.000 0.000 won -100.00%
1.000 1.000 tied
1.000 0.500 won -50.00%
0.000 0.000 tied
0.500 0.500 tied
2.000 1.500 won -25.00%
won 5 times
tied 5 times
lost 0 times
total unique fn went from 16 to 10 won -37.50%
mean fn % went from 0.8 to 0.5 won -37.50%
The 60.0 bins in the histogram have 2 hams and 7 spams, so moving the
cutoff to 0.625 would have made it a tie for fps and a loss by 1 for
fns.
Ah, and here are the results for md=15 (left == md=150, right == md=15):
false positive percentages
0.500 0.000 won -100.00%
0.000 0.000 tied
0.000 0.500 lost +(was 0)
0.000 1.000 lost +(was 0)
0.500 1.000 lost +100.00%
0.500 0.500 tied
0.000 0.500 lost +(was 0)
0.500 1.500 lost +200.00%
0.000 0.000 tied
1.000 1.000 tied
won 1 times
tied 4 times
lost 5 times
total unique fp went from 6 to 12 lost +100.00%
mean fp % went from 0.3 to 0.6 lost +100.00%
false negative percentages
0.500 0.000 won -100.00%
0.500 0.500 tied
0.500 0.500 tied
1.000 0.500 won -50.00%
1.000 0.000 won -100.00%
1.000 1.000 tied
1.000 0.500 won -50.00%
0.000 0.000 tied
0.500 0.500 tied
2.000 1.500 won -25.00%
won 5 times
tied 5 times
lost 0 times
total unique fn went from 16 to 10 won -37.50%
mean fn % went from 0.8 to 0.5 won -37.50%
The histograms look totally different here though!
-> <stat> Ham scores for all runs: 2000 items; mean 11.01; sample sdev 15.30
* = 21 items
0.00 1201 **********************************************************
2.50 53 ***
5.00 9 *
7.50 7 *
10.00 2 *
12.50 29 **
15.00 87 *****
17.50 100 *****
20.00 67 ****
22.50 48 ***
25.00 54 ***
27.50 54 ***
30.00 41 **
32.50 49 ***
35.00 46 ***
37.50 29 **
40.00 23 **
42.50 30 **
45.00 19 *
47.50 16 *
50.00 9 *
52.50 7 *
55.00 2 *
57.50 6 *
60.00 5 *
62.50 1 *
65.00 1 *
67.50 1 *
70.00 0
72.50 0
75.00 0
77.50 2 *
80.00 0
82.50 0
85.00 1 *
87.50 0
90.00 0
92.50 0
95.00 1 *
97.50 0
-> <stat> Spam scores for all runs: 2000 items; mean 95.00; sample sdev 7.86
* = 21 items
[...]
47.50 1 *
50.00 1 *
52.50 1 *
55.00 3 *
57.50 4 *
60.00 8 *
62.50 9 *
65.00 11 *
67.50 15 *
70.00 13 *
72.50 15 *
75.00 23 **
77.50 51 ***
80.00 65 ****
82.50 41 **
85.00 2 *
87.50 9 *
90.00 12 *
92.50 66 ****
95.00 425 *********************
97.50 1225 ***********************************************************
Note the hams scoring all the way in the 90s. There are no spams
here! I think the highest scoring ham was Kim's message with diet
tips (also implicated in the Graham run) -- typical diet terms don't
occur much in my ham, but I guess spams for magic diets are common.
Kim sent this with almost no happy-talk; there were only 3 low-scoring
words amongst the 15.
Given this, I'm going to keep this parameter at 150 and vary other
things.
--Guido van Rossum (home page: http://www.python.org/~guido/)