[Spambayes] Moving closer to Gary's ideal

Sun, 22 Sep 2002 03:45:50 -0400

I asked:
> > What's the ideal cutoff here to compete with Graham?
> >
> > The last 4 output lines from result.py for that set are:
> >
> > total unique false pos 40
> > total unique false neg 204
> > average fp % 0.0480748377883
> > average fn % 0.634757552464
> >
> > For my Robinson run with cutoff = 0.575, they are:
> >
> > total unique false pos 101
> > total unique false neg 129
> > average fp % 0.121612480864
> > average fn % 0.401042537411

but I forgot to include the "full set" histograms.  Here's the ham:

Ham distribution for all runs:
* = 268 items
  5.00     2 *
  7.50     0 
 10.00     4 *
 12.50    64 *
 15.00   262 *
 17.50   748 ***
 20.00  1734 *******
 22.50  3665 **************
 25.00  7662 *****************************
 27.50 12597 ************************************************
 30.00 16039 ************************************************************
 32.50 14821 ********************************************************
 35.00 10542 ****************************************
 37.50  6441 *************************
 40.00  3593 **************
 42.50  2073 ********
 45.00  1177 *****
 47.50   717 ***
 50.00   421 **
 52.50   197 *
 55.00    96 *
 57.50    60 *
 60.00    16 *
 62.50    10 *
 65.00     3 *
 67.50     1 *
 70.00     1 *
 72.50     2 *
 75.00     8 *

So to match the 40 fp's from Graham's scheme, I'd need to set the
cutoff to 0.60; that would give me 41 fp's here (16+10+3+1+1+2+8).
(If a message scores *exactly* the cutoff, is it spam or ham?)

Spam distribution for all runs:
* = 124 items
 37.50    1 *
 40.00    0 
 42.50    0 
 45.00    4 *
 47.50   12 *
 50.00   22 *
 52.50   20 *
 55.00   70 *
 57.50  153 **
 60.00  346 ***
 62.50  664 ******
 65.00 1334 ***********
 67.50 2503 *********************
 70.00 4303 ***********************************
 72.50 7136 **********************************************************
 75.00 7385 ************************************************************
 77.50 4918 ****************************************
 80.00 2481 *********************
 82.50  513 *****
 85.00  116 *
 87.50   73 *
 90.00   14 *

So with a cutoff of 0.60, this would give me 1+4+12+22+20+70+153 = 282
fn's.  That's still considerably worse than Graham's 204.

I'm going to have to look at the fp's and fn's to see if there are
real spams hiding in the ham, and vice versa.  I did notice that many
fp's were very spammish automated postings that I have specifically
signed up for, like our building's announcements, product newsletters,
and so on.  I haven't looked at the fn's.

> > Well, I for one, couldn't decide by staring at the two histograms
> > above which one to call "fatter".
> 
> It was your ham, but again I ask what use you would have for the
> mean and sdev if I bothered to compute and display them?

I was interested in which of the two bell curves was "fatter".  The
sdev tells me this.  But I'm not sure that the relative fatness of the
curves is a good measure -- it's the overlapping tails.  I suppose
there's a statistical measure for how "normal" a tail is, but I'm not
sure that's relevant given that we can easily see the overlap in the
histograms.

--Guido van Rossum (home page: http://www.python.org/~guido/)