[Spambayes] Moving closer to Gary's ideal
Guido van Rossum
guido@python.org
Sun, 22 Sep 2002 02:25:16 -0400
> You can tell for sure just by looking at the score histograms and counting
> the dots <wink>; there's no need to change spam_cutoff and then rerun the
> test (spam_cutoff has no effect on the scores computed); I've walked through
> that process in slow motion several times on the list now.
One thing isn't clear to me. Does a dot at, say, 50.00 mean that
there are X items whose score is between 48.75 and 51.25, or does it
mean those items are between 47.50 and 50.00?
OK, here are my histograms (truncated):
Ham distribution for all in this training set:
* = 27 items
5.00 1 *
7.50 0
10.00 0
12.50 8 *
15.00 35 **
17.50 68 ***
20.00 172 *******
22.50 342 *************
25.00 729 ***************************
27.50 1305 *************************************************
30.00 1570 ***********************************************************
32.50 1466 *******************************************************
35.00 1104 *****************************************
37.50 634 ************************
40.00 347 *************
42.50 213 ********
45.00 126 *****
47.50 74 ***
50.00 43 **
52.50 14 *
55.00 10 *
57.50 8 *
60.00 3 *
62.50 1 *
65.00 1 *
67.50 0
70.00 0
72.50 0
75.00 2 *
Spam distribution for all in this training set:
* = 12 items
45.00 1 *
47.50 1 *
50.00 2 *
52.50 2 *
55.00 4 *
57.50 16 **
60.00 40 ****
62.50 62 ******
65.00 107 *********
67.50 241 *********************
70.00 452 **************************************
72.50 719 ************************************************************
75.00 717 ************************************************************
77.50 501 ******************************************
80.00 261 **********************
82.50 45 ****
85.00 13 **
87.50 7 *
What's the ideal cutoff here to compete with Graham? The last 4
output lines from result.py for that set are:
total unique false pos 40
total unique false neg 204
average fp % 0.0480748377883
average fn % 0.634757552464
For my Robinson run with cutoff = 0.575, they are:
total unique false pos 101
total unique false neg 129
average fp % 0.121612480864
average fn % 0.401042537411
> An observed effect of setting robinson_minimum_prob_strength is to
> increase the separation of the ham and spam means: the ham mean gets
> lower and the spam mean gets higher. This is what I expected,
> since, unlike as in Graham's scheme, scoring words with neutral
> probability in Gary's scheme drags a score closer to 0.5. Now
> "drags" sounds pejorative, because that's the way I feel about it --
> I see no value in scoring neutral words at all in this task. Gary
> disagrees, but allows that it's more of a "purist" issue than a
> pragmatic one. However, something we agree 100% on is that
> measuring the effects of *principled* changes gets much harder if
> pragmatic hacks muddy the mathematical basis of a scheme. If Gary's
> scheme proves to be as good as, but no better than, our current
> scheme, I'd still switch to it for this reason: it has far fewer
> "mystery knobs" to confuse the underlying issues.
>
> > (Hm, have you computed mean and standard deviation?)
>
> Nope. What would you do with them if I did (they're easy enough to
> compute and display if there's a point to it)? You can get an
> excellent feel for them by looking at the histograms (which reveal
> far more than a pair of (mean, sdev) numbers anyway).
Well, I for one, couldn't decide by staring at the two histograms
above which one to call "fatter".
--Guido van Rossum (home page: http://www.python.org/~guido/)