[Spambayes] There Can Be Only One

Guido van Rossum guido@python.org
Tue, 24 Sep 2002 22:50:44 -0400


> Wow!  0.625 is the largest reported "best value" to date (Neil and I
> both needed 0.6 once; the suggested 0.55 was optimal for my run --
> which is why I suggested it <wink>).

0.6 is the *lowest* that works for me -- any lower and Graham
wins on both fp's and fn's.

> > (after looking at the histograms of a trial run).
> >
> > Net result:
> >
> > Graham: 8 fp's, 13 fn's.
> > Gary:   7 fp's, 13 fn's.
> >
> > I ran the same two tests with a different random number:
> >
> > Graham: 8 fp's, 11 fn's.
> > Gary:   6 fp's, 16 fn's.
> >
> > According to the histogram, Gary would have given 8 fp's and 10 fn's
> > with a cutoff of 0.6, again beating Graham with the smallest margin.
> 
> Have you used cmp.py?  It generates side-by-side listings comparing
> two runs, and it's extremely important to know how often each scheme
> beat the other; the summary numbers reveal nothing about that.  For
> this particular test, the tail end of the cmp.py output (with the
> changes in ham and spam score means and sdevs) is useless.  When
> tweaking paramters for a single scheme, though, all the cmp.py
> output is dripping with valuable clues.

Ok.  For the first pair:

false positive percentages
    0.500  1.000  lost  +100.00%
    0.000  0.500  lost  +(was 0)
    0.500  0.500  tied          
    1.000  0.500  won    -50.00%
    1.000  0.500  won    -50.00%
    0.000  0.000  tied          
    0.500  0.000  won   -100.00%
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          

won   3 times
tied  5 times
lost  2 times

total unique fp went from 8 to 7 won    -12.50%
mean fp % went from 0.4 to 0.35 won    -12.50%

false negative percentages
    1.000  0.000  won   -100.00%
    1.500  1.500  tied          
    0.500  0.500  tied          
    0.500  0.000  won   -100.00%
    0.500  0.000  won   -100.00%
    0.500  0.500  tied          
    1.000  1.000  tied          
    0.500  1.000  lost  +100.00%
    0.500  2.000  lost  +300.00%
    0.000  0.000  tied          

won   3 times
tied  5 times
lost  2 times

total unique fn went from 13 to 13 tied          
mean fn % went from 0.65 to 0.65 tied          

For the second pair:

false positive percentages
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.000  won   -100.00%
    0.500  0.500  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          
    1.500  1.000  won    -33.33%

won   2 times
tied  8 times
lost  0 times

total unique fp went from 8 to 6 won    -25.00%
mean fp % went from 0.4 to 0.3 won    -25.00%

false negative percentages
    0.500  0.500  tied          
    1.000  0.500  won    -50.00%
    0.000  0.500  lost  +(was 0)
    1.000  1.000  tied          
    0.000  1.000  lost  +(was 0)
    1.000  1.000  tied          
    0.500  1.000  lost  +100.00%
    0.000  0.000  tied          
    0.500  0.500  tied          
    1.000  2.000  lost  +100.00%

won   1 times
tied  5 times
lost  4 times

total unique fn went from 11 to 16 lost   +45.45%
mean fn % went from 0.55 to 0.8 lost   +45.45%

> > I found one more spam in my ham (but didn't remove it between these
> > runs).  I also found 10 empty messages in Bruce Guenter's spam
> > archives!  (1 in 2002/01, 2 in 2002/05, 7 in 2002/06.)  Tim, I presume
> > you cleaned these out long ago?
> 
> No, I've left empty messages in both my corpora.  Although I'm unclear on
> what you mean by "empty".  I mean they have no body, but do have message
> headers.  Sometimes a c.l.py ham consists solely of a question in the
> Subject line!

No, I have 10 files that have length 0 (i.e. no headers and no body)
in BruceG's original bz2 files.  I checked, and the tar listing has
these too.

Here are the results for my first Gary variation: changing
max_discriminators to 1500, while keeping the cutoff at 0.60:

false positive percentages
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.500  0.500  tied          
    0.000  0.500  lost  +(was 0)
    0.500  0.500  tied          
    0.000  0.000  tied          
    1.000  1.500  lost   +50.00%

won   0 times
tied  8 times
lost  2 times

total unique fp went from 6 to 8 lost   +33.33%
mean fp % went from 0.3 to 0.4 lost   +33.33%

false negative percentages
    0.500  0.000  won   -100.00%
    0.500  0.500  tied          
    0.500  0.500  tied          
    1.000  0.500  won    -50.00%
    1.000  0.000  won   -100.00%
    1.000  1.000  tied          
    1.000  0.500  won    -50.00%
    0.000  0.000  tied          
    0.500  0.500  tied          
    2.000  1.500  won    -25.00%

won   5 times
tied  5 times
lost  0 times

total unique fn went from 16 to 10 won    -37.50%
mean fn % went from 0.8 to 0.5 won    -37.50%

The 60.0 bins in the histogram have 2 hams and 7 spams, so moving the
cutoff to 0.625 would have made it a tie for fps and a loss by 1 for
fns.

Ah, and here are the results for md=15 (left == md=150, right == md=15):

false positive percentages
    0.500  0.000  won   -100.00%
    0.000  0.000  tied          
    0.000  0.500  lost  +(was 0)
    0.000  1.000  lost  +(was 0)
    0.500  1.000  lost  +100.00%
    0.500  0.500  tied          
    0.000  0.500  lost  +(was 0)
    0.500  1.500  lost  +200.00%
    0.000  0.000  tied          
    1.000  1.000  tied          

won   1 times
tied  4 times
lost  5 times

total unique fp went from 6 to 12 lost  +100.00%
mean fp % went from 0.3 to 0.6 lost  +100.00%

false negative percentages
    0.500  0.000  won   -100.00%
    0.500  0.500  tied          
    0.500  0.500  tied          
    1.000  0.500  won    -50.00%
    1.000  0.000  won   -100.00%
    1.000  1.000  tied          
    1.000  0.500  won    -50.00%
    0.000  0.000  tied          
    0.500  0.500  tied          
    2.000  1.500  won    -25.00%

won   5 times
tied  5 times
lost  0 times

total unique fn went from 16 to 10 won    -37.50%
mean fn % went from 0.8 to 0.5 won    -37.50%

The histograms look totally different here though!

-> <stat> Ham scores for all runs: 2000 items; mean 11.01; sample sdev 15.30
* = 21 items
  0.00 1201 **********************************************************
  2.50   53 ***
  5.00    9 *
  7.50    7 *
 10.00    2 *
 12.50   29 **
 15.00   87 *****
 17.50  100 *****
 20.00   67 ****
 22.50   48 ***
 25.00   54 ***
 27.50   54 ***
 30.00   41 **
 32.50   49 ***
 35.00   46 ***
 37.50   29 **
 40.00   23 **
 42.50   30 **
 45.00   19 *
 47.50   16 *
 50.00    9 *
 52.50    7 *
 55.00    2 *
 57.50    6 *
 60.00    5 *
 62.50    1 *
 65.00    1 *
 67.50    1 *
 70.00    0 
 72.50    0 
 75.00    0 
 77.50    2 *
 80.00    0 
 82.50    0 
 85.00    1 *
 87.50    0 
 90.00    0 
 92.50    0 
 95.00    1 *
 97.50    0 

-> <stat> Spam scores for all runs: 2000 items; mean 95.00; sample sdev 7.86
* = 21 items
[...]
 47.50    1 *
 50.00    1 *
 52.50    1 *
 55.00    3 *
 57.50    4 *
 60.00    8 *
 62.50    9 *
 65.00   11 *
 67.50   15 *
 70.00   13 *
 72.50   15 *
 75.00   23 **
 77.50   51 ***
 80.00   65 ****
 82.50   41 **
 85.00    2 *
 87.50    9 *
 90.00   12 *
 92.50   66 ****
 95.00  425 *********************
 97.50 1225 ***********************************************************

Note the hams scoring all the way in the 90s.  There are no spams
here!  I think the highest scoring ham was Kim's message with diet
tips (also implicated in the Graham run) -- typical diet terms don't
occur much in my ham, but I guess spams for magic diets are common.
Kim sent this with almost no happy-talk; there were only 3 low-scoring
words amongst the 15.

Given this, I'm going to keep this parameter at 150 and vary other
things.

--Guido van Rossum (home page: http://www.python.org/~guido/)