[Spambayes] Moving closer to Gary's ideal

Guido van Rossum guido@python.org
Sun, 22 Sep 2002 02:25:16 -0400


> You can tell for sure just by looking at the score histograms and counting
> the dots <wink>; there's no need to change spam_cutoff and then rerun the
> test (spam_cutoff has no effect on the scores computed); I've walked through
> that process in slow motion several times on the list now.

One thing isn't clear to me.  Does a dot at, say, 50.00 mean that
there are X items whose score is between 48.75 and 51.25, or does it
mean those items are between 47.50 and 50.00?

OK, here are my histograms (truncated):

Ham distribution for all in this training set:
* = 27 items
  5.00    1 *
  7.50    0 
 10.00    0 
 12.50    8 *
 15.00   35 **
 17.50   68 ***
 20.00  172 *******
 22.50  342 *************
 25.00  729 ***************************
 27.50 1305 *************************************************
 30.00 1570 ***********************************************************
 32.50 1466 *******************************************************
 35.00 1104 *****************************************
 37.50  634 ************************
 40.00  347 *************
 42.50  213 ********
 45.00  126 *****
 47.50   74 ***
 50.00   43 **
 52.50   14 *
 55.00   10 *
 57.50    8 *
 60.00    3 *
 62.50    1 *
 65.00    1 *
 67.50    0 
 70.00    0 
 72.50    0 
 75.00    2 *

Spam distribution for all in this training set:
* = 12 items
 45.00   1 *
 47.50   1 *
 50.00   2 *
 52.50   2 *
 55.00   4 *
 57.50  16 **
 60.00  40 ****
 62.50  62 ******
 65.00 107 *********
 67.50 241 *********************
 70.00 452 **************************************
 72.50 719 ************************************************************
 75.00 717 ************************************************************
 77.50 501 ******************************************
 80.00 261 **********************
 82.50  45 ****
 85.00  13 **
 87.50   7 *

What's the ideal cutoff here to compete with Graham?  The last 4
output lines from result.py for that set are:

total unique false pos 40
total unique false neg 204
average fp % 0.0480748377883
average fn % 0.634757552464

For my Robinson run with cutoff = 0.575, they are:

total unique false pos 101
total unique false neg 129
average fp % 0.121612480864
average fn % 0.401042537411

> An observed effect of setting robinson_minimum_prob_strength is to
> increase the separation of the ham and spam means: the ham mean gets
> lower and the spam mean gets higher.  This is what I expected,
> since, unlike as in Graham's scheme, scoring words with neutral
> probability in Gary's scheme drags a score closer to 0.5.  Now
> "drags" sounds pejorative, because that's the way I feel about it --
> I see no value in scoring neutral words at all in this task.  Gary
> disagrees, but allows that it's more of a "purist" issue than a
> pragmatic one.  However, something we agree 100% on is that
> measuring the effects of *principled* changes gets much harder if
> pragmatic hacks muddy the mathematical basis of a scheme.  If Gary's
> scheme proves to be as good as, but no better than, our current
> scheme, I'd still switch to it for this reason: it has far fewer
> "mystery knobs" to confuse the underlying issues.
> 
> > (Hm, have you computed mean and standard deviation?)
> 
> Nope.  What would you do with them if I did (they're easy enough to
> compute and display if there's a point to it)?  You can get an
> excellent feel for them by looking at the histograms (which reveal
> far more than a pair of (mean, sdev) numbers anyway).

Well, I for one, couldn't decide by staring at the two histograms
above which one to call "fatter".

--Guido van Rossum (home page: http://www.python.org/~guido/)