# [Spambayes] Moving closer to Gary's ideal

Guido van Rossum guido@python.org
Sun, 22 Sep 2002 02:25:16 -0400

```> You can tell for sure just by looking at the score histograms and counting
> the dots <wink>; there's no need to change spam_cutoff and then rerun the
> test (spam_cutoff has no effect on the scores computed); I've walked through
> that process in slow motion several times on the list now.

One thing isn't clear to me.  Does a dot at, say, 50.00 mean that
there are X items whose score is between 48.75 and 51.25, or does it
mean those items are between 47.50 and 50.00?

OK, here are my histograms (truncated):

Ham distribution for all in this training set:
* = 27 items
5.00    1 *
7.50    0
10.00    0
12.50    8 *
15.00   35 **
17.50   68 ***
20.00  172 *******
22.50  342 *************
25.00  729 ***************************
27.50 1305 *************************************************
30.00 1570 ***********************************************************
32.50 1466 *******************************************************
35.00 1104 *****************************************
37.50  634 ************************
40.00  347 *************
42.50  213 ********
45.00  126 *****
47.50   74 ***
50.00   43 **
52.50   14 *
55.00   10 *
57.50    8 *
60.00    3 *
62.50    1 *
65.00    1 *
67.50    0
70.00    0
72.50    0
75.00    2 *

Spam distribution for all in this training set:
* = 12 items
45.00   1 *
47.50   1 *
50.00   2 *
52.50   2 *
55.00   4 *
57.50  16 **
60.00  40 ****
62.50  62 ******
65.00 107 *********
67.50 241 *********************
70.00 452 **************************************
72.50 719 ************************************************************
75.00 717 ************************************************************
77.50 501 ******************************************
80.00 261 **********************
82.50  45 ****
85.00  13 **
87.50   7 *

What's the ideal cutoff here to compete with Graham?  The last 4
output lines from result.py for that set are:

total unique false pos 40
total unique false neg 204
average fp % 0.0480748377883
average fn % 0.634757552464

For my Robinson run with cutoff = 0.575, they are:

total unique false pos 101
total unique false neg 129
average fp % 0.121612480864
average fn % 0.401042537411

> An observed effect of setting robinson_minimum_prob_strength is to
> increase the separation of the ham and spam means: the ham mean gets
> lower and the spam mean gets higher.  This is what I expected,
> since, unlike as in Graham's scheme, scoring words with neutral
> probability in Gary's scheme drags a score closer to 0.5.  Now
> "drags" sounds pejorative, because that's the way I feel about it --
> I see no value in scoring neutral words at all in this task.  Gary
> disagrees, but allows that it's more of a "purist" issue than a
> pragmatic one.  However, something we agree 100% on is that
> measuring the effects of *principled* changes gets much harder if
> pragmatic hacks muddy the mathematical basis of a scheme.  If Gary's
> scheme proves to be as good as, but no better than, our current
> scheme, I'd still switch to it for this reason: it has far fewer
> "mystery knobs" to confuse the underlying issues.
>
> > (Hm, have you computed mean and standard deviation?)
>
> Nope.  What would you do with them if I did (they're easy enough to
> compute and display if there's a point to it)?  You can get an
> excellent feel for them by looking at the histograms (which reveal
> far more than a pair of (mean, sdev) numbers anyway).

Well, I for one, couldn't decide by staring at the two histograms
above which one to call "fatter".