[Spambayes] First result from Gary Robinson's ideas

Tim Peters tim.one@comcast.net
Wed, 18 Sep 2002 22:44:34 -0400


[Neale Pickett, quoting Neale Pickett]
>> total unique fn went from 12 to 1000 lost  +8233.33%
>> mean fn % went from 1.2 to 100.0 lost  +8233.33%

[and then Neale gets on Neale's case]
> Please disregard those results.  This says that every single message in
> my spam corpus got tagged as ham with this change.  Investigating, I
> found that I had neglegted to remove a line calculating prob after
> inserting Tim's new code, so everything was getting a probability of
> 0.5.  On the positive side, my FP rate went to 0!  ;)

You can forgive yourself, if you have to <wink>.  I should have given you a
patch instead of an English description of what to do!  There was another
problem here too, though, as the summary statistics quoted above obviously
didn't come from the same run from which your score histograms came:  the
historgrams showed a low false positive rate.

[and after all that was repaired]
> ...
> Additionally, I changed the "spam cutoff" from 0.9 to 0.5.  Comparing
> the results before (run1) and after (run2), I get:
>
> """
> run1s -> run2s
> -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
> -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
> -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
> -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
> -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
> -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
> -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
> -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
> -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
> -> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
>
> false positive percentages
>     2.000  2.000  tied
>     1.500  1.500  tied
>     2.000  2.000  tied
>     1.000  1.500  lost   +50.00%
>     0.500  0.500  tied
>
> won   0 times
> tied  4 times
> lost  1 times
>
> total unique fp went from 14 to 15 lost    +7.14%
> mean fp % went from 1.4 to 1.5 lost    +7.14%
>
> false negative percentages
>     1.500  1.000  won    -33.33%
>     1.000  1.000  tied
>     1.500  1.500  tied
>     1.500  1.500  tied
>     1.000  1.000  tied
>
> won   1 times
> tied  4 times
> lost  0 times
>
> total unique fn went from 13 to 12 won     -7.69%
> mean fn % went from 1.3 to 1.2 won     -7.69%
> """
>
> So false positives basically stayed the same--the one case where the
> false positives got worse, it was only by one message, which I would
> imagine is within the margin of error, but I Am Not A Statistician :).

Intuition is good enough at extremes:  one lousy message is indeed one lousy
message, no matter what the percentage change.

> But, as Tim said earlier, what's really interesting is the distribution
> of scores across all runs.  The first run, without Gary's modification,
> gives me the following distributions:
>
> """
> Ham distribution for all runs:
> * = 17 items
>   0.00 984 **********************************************************
>   2.50   1 *
>   5.00   0
>   7.50   0
>  10.00   0
>  12.50   0
>  15.00   0
>  17.50   0
>  20.00   0
>  22.50   0
>  25.00   0
>  27.50   0
>  30.00   0
>  32.50   0
>  35.00   0
>  37.50   0
>  40.00   0
>  42.50   0
>  45.00   0
>  47.50   0
>  50.00   1 *
>  52.50   0
>  55.00   0
>  57.50   0
>  60.00   0
>  62.50   0
>  65.00   0
>  67.50   0
>  70.00   0
>  72.50   0
>  75.00   0
>  77.50   0
>  80.00   0
>  82.50   0
>  85.00   0
>  87.50   0
>  90.00   0
>  92.50   0
>  95.00   0
>  97.50  14 *
>
> Spam distribution for all runs:
> * = 17 items
>   0.00  11 *
>   2.50   0
>   5.00   0
>   7.50   0
>  10.00   0
>  12.50   0
>  15.00   0
>  17.50   0
>  20.00   0
>  22.50   0
>  25.00   0
>  27.50   0
>  30.00   0
>  32.50   0
>  35.00   0
>  37.50   0
>  40.00   0
>  42.50   0
>  45.00   0
>  47.50   0
>  50.00   1 *
>  52.50   0
>  55.00   0
>  57.50   0
>  60.00   0
>  62.50   0
>  65.00   0
>  67.50   0
>  70.00   0
>  72.50   0
>  75.00   0
>  77.50   0
>  80.00   0
>  82.50   1 *
>  85.00   0
>  87.50   0
>  90.00   0
>  92.50   2 *
>  95.00   3 *
>  97.50 982 **********************************************************
> """
>
> Your typical Grahamian black-or-white picture, with little middle
> ground.  With Gary's idea, however, comes many more shades of gray:
>
> """
> Ham distribution for all runs:
> * = 12 items
>   0.00 681 *********************************************************
>   2.50  62 ******
>   5.00  18 **
>   7.50  10 *
>  10.00  14 **
>  12.50  33 ***
>  15.00  40 ****
>  17.50  28 ***
>  20.00  14 **
>  22.50  22 **
>  25.00   5 *
>  27.50  11 *
>  30.00  13 **
>  32.50  10 *
>  35.00   6 *
>  37.50   5 *
>  40.00   3 *
>  42.50   5 *
>  45.00   5 *
>  47.50   0

In particular:

>  50.00   1 *
>  52.50   4 *
>  55.00   1 *
>  57.50   1 *

7 (about half) of your false positives scored in the [50, 60) range, and

>  60.00   2 *
>  62.50   1 *
>  65.00   0
>  67.50   1 *

4 scored in [60, 70).  Now I've tweaked more things than I can keep track of
to improve the results using Paul's combining formula, and all of those are
open to re-tweaking to improve results with Gary's.  For the next step,
though, I want to try more of Gary's still-untested ideas.

>  70.00   2 *
>  72.50   0
>  75.00   0
>  77.50   0
>  80.00   0
>  82.50   0
>  85.00   0
>  87.50   0
>  90.00   0
>  92.50   0
>  95.00   0
>  97.50   2 *
>
> Spam distribution for all runs:
> * = 14 items
>   0.00   1 *
>   2.50   0
>   5.00   0
>   7.50   0
>  10.00   0
>  12.50   0
>  15.00   0
>  17.50   0
>  20.00   0
>  22.50   0
>  25.00   0
>  27.50   0

And here 5 of your false negatives lived in [30, 40):

>  30.00   0
>  32.50   4 *
>  35.00   0
>  37.50   1 *

and 5 more in [40, 50) (which leaves only the 1 at 0.00 outside of [30, 50):

>  40.00   1 *
>  42.50   1 *
>  45.00   3 *
>  47.50   0

>  50.00   4 *
>  52.50   4 *
>  55.00   2 *
>  57.50   9 *
>  60.00  13 *
>  62.50   9 *
>  65.00  12 *
>  67.50  14 *
>  70.00   9 *
>  72.50  15 **
>  75.00   8 *
>  77.50   7 *
>  80.00  11 *
>  82.50  12 *
>  85.00   9 *
>  87.50   0
>  90.00   1 *
>  92.50   2 *
>  95.00  13 *
>  97.50 835 ************************************************************
> """
>
> So from my perspective (and again, IANAS) it looks like the algorithm
> has gained some humility and is admitting when it's not sure about
> stuff.  I can't say this change is a clear win for my miniscule data
> set, but it *does* appear to make the probability more meaningful.
> Almost like the difference between linear space and log space.

It's encouraging that this pretty much exactly mirrors my results on my
monster-large-- and very different --test corpus.  I still conjecture that
my error rates are lower almost entirely because my training data is so much
larger, but we still haven't got a solid experimental handle on that, and
probably won't unless more people can be shamed <wink> into playing the
testing game.  I need to check in code to make this particular comparison
easier to do, though.