[Spambayes] There Can Be Only One

Tim Peters tim.one@comcast.net
Wed, 25 Sep 2002 22:01:10 -0400


[Guido, tries Gary's scheme with max_discrimators at 150, and then at 15]
> ...
> Ah, and here are the results for md=15 (left == md=150, right == md=15):
>
> false positive percentages
>     0.500  0.000  won   -100.00%
>     0.000  0.000  tied
>     0.000  0.500  lost  +(was 0)
>     0.000  1.000  lost  +(was 0)
>     0.500  1.000  lost  +100.00%
>     0.500  0.500  tied
>     0.000  0.500  lost  +(was 0)
>     0.500  1.500  lost  +200.00%
>     0.000  0.000  tied
>     1.000  1.000  tied
>
> won   1 times
> tied  4 times
> lost  5 times
>
> total unique fp went from 6 to 12 lost  +100.00%
> mean fp % went from 0.3 to 0.6 lost  +100.00%
>
> false negative percentages
>     0.500  0.000  won   -100.00%
>     0.500  0.500  tied
>     0.500  0.500  tied
>     1.000  0.500  won    -50.00%
>     1.000  0.000  won   -100.00%
>     1.000  1.000  tied
>     1.000  0.500  won    -50.00%
>     0.000  0.000  tied
>     0.500  0.500  tied
>     2.000  1.500  won    -25.00%
>
> won   5 times
> tied  5 times
> lost  0 times
>
> total unique fn went from 16 to 10 won    -37.50%
> mean fn % went from 0.8 to 0.5 won    -37.50%
>
> The histograms look totally different here though!

That part isn't surprising -- this is making it look at only two handfuls of
*the* most extreme words in a msg, just like the Graham scheme does.  "Only
extremes in, only extremes out" applies here too, although not as viciously
under Gary's combining scheme as under Graham's.

> -> <stat> Ham scores for all runs: 2000 items; mean 11.01; sample
> sdev 15.30
> * = 21 items
>   0.00 1201 **********************************************************
>   2.50   53 ***
>   5.00    9 *
>   7.50    7 *
>  10.00    2 *
>  12.50   29 **
>  15.00   87 *****
>  17.50  100 *****
>  20.00   67 ****
>  22.50   48 ***
>  25.00   54 ***
>  27.50   54 ***
>  30.00   41 **
>  32.50   49 ***
>  35.00   46 ***
>  37.50   29 **
>  40.00   23 **
>  42.50   30 **
>  45.00   19 *
>  47.50   16 *
>  50.00    9 *
>  52.50    7 *
>  55.00    2 *
>  57.50    6 *
>  60.00    5 *
>  62.50    1 *
>  65.00    1 *
>  67.50    1 *
>  70.00    0
>  72.50    0
>  75.00    0
>  77.50    2 *
>  80.00    0
>  82.50    0
>  85.00    1 *
>  87.50    0
>  90.00    0
>  92.50    0
>  95.00    1 *
>  97.50    0
>
> -> <stat> Spam scores for all runs: 2000 items; mean 95.00;
> sample sdev 7.86
> * = 21 items
> [...]
>  47.50    1 *
>  50.00    1 *
>  52.50    1 *
>  55.00    3 *
>  57.50    4 *
>  60.00    8 *
>  62.50    9 *
>  65.00   11 *
>  67.50   15 *
>  70.00   13 *
>  72.50   15 *
>  75.00   23 **
>  77.50   51 ***
>  80.00   65 ****
>  82.50   41 **
>  85.00    2 *
>  87.50    9 *
>  90.00   12 *
>  92.50   66 ****
>  95.00  425 *********************
>  97.50 1225 ***********************************************************
>
> Note the hams scoring all the way in the 90s.

We see that under Graham's scheme too (else there would never be false
positives there -- we use spam_cutoff 0.90 there); extremes in, etc.

> There are no spams here!

I didn't catch the meaning.  You had 1728 (12+66+425+1225) spam "scoring all
the way in the 90s", so "here" must refer to something else?