[Spambayes] Moving closer to Gary's ideal

Sun, 22 Sep 2002 03:26:14 -0400

[Guido]
> One thing isn't clear to me.  Does a dot at, say, 50.00 mean that
> there are X items whose score is between 48.75 and 51.25, or does it
> mean those items are between 47.50 and 50.00?

That the first bucket is labelled 0 and the last (if you're using 40
buckets) 97.5 is A Clue:  in your example, it means an item in the 50 bucket
has a score S satisfying

    50.00 <= S < 52.50

(again assuming you're using 40 buckets).

An exception is made for the final bucket, which includes all scores of 100
(I've never seen one of those under Gary's scheme, though).

Since you're a Python guy <wink>, except for that endcase the rest is easy
to remember via picturing

    bucketcount[int(score * nbuckets)] += 1

> OK, here are my histograms (truncated):
>
> Ham distribution for all in this training set:
> * = 27 items
>   5.00    1 *
>   7.50    0
>  10.00    0
>  12.50    8 *
>  15.00   35 **
>  17.50   68 ***
>  20.00  172 *******
>  22.50  342 *************
>  25.00  729 ***************************
>  27.50 1305 *************************************************
>  30.00 1570 ***********************************************************
>  32.50 1466 *******************************************************
>  35.00 1104 *****************************************
>  37.50  634 ************************
>  40.00  347 *************
>  42.50  213 ********
>  45.00  126 *****
>  47.50   74 ***
>  50.00   43 **
>  52.50   14 *
>  55.00   10 *
>  57.50    8 *
>  60.00    3 *
>  62.50    1 *
>  65.00    1 *
>  67.50    0
>  70.00    0
>  72.50    0
>  75.00    2 *

So if you set your spam_cutoff to 0.75, you would have 2 false positives.
Ditto for smaller settings until you get to 0.65, at which point you'd have
1+2 = 3 f-p; drop it to 0.625 and you'd pick up one more; drop to 0.6 and
you'd get 3 more; etc.

> Spam distribution for all in this training set:
> * = 12 items
>  45.00   1 *
>  47.50   1 *
>  50.00   2 *
>  52.50   2 *
>  55.00   4 *
>  57.50  16 **
>  60.00  40 ****
>  62.50  62 ******
>  65.00 107 *********
>  67.50 241 *********************
>  70.00 452 **************************************
>  72.50 719 ************************************************************
>  75.00 717 ************************************************************
>  77.50 501 ******************************************
>  80.00 261 **********************
>  82.50  45 ****
>  85.00  13 **
>  87.50   7 *
>
> What's the ideal cutoff here to compete with Graham?
>
> The last 4 output lines from result.py for that set are:
>
> total unique false pos 40
> total unique false neg 204
> average fp % 0.0480748377883
> average fn % 0.634757552464
>
> For my Robinson run with cutoff = 0.575, they are:
>
> total unique false pos 101
> total unique false neg 129
> average fp % 0.121612480864
> average fn % 0.401042537411

I can't answer the question:  you showed a historgram for a single training
set above ("Spam distribution for all in this training set").  To match the
figures you gave just above, you have to look at the aggregate "all runs"
histograms at the end of the output.

Here's the relevant slice of your ham histo:

>  57.50    8 *
>  60.00    3 *
>  62.50    1 *
>  65.00    1 *
>  67.50    0
>  70.00    0
>  72.50    0
>  75.00    2 *

You had 15 (8+3+1+1+0+0+0+2) ham total that scored at or above 0.575 *in
this particular sub-run*.  If you're running a 10-fold cross validation,
there are 9 other per-sub-run histogram pairs I haven't seen, plus 10 other
pairs that only make good sense when running timtest.py, and the final
histogram pair at the end that adds them all up.

> ...
> Well, I for one, couldn't decide by staring at the two histograms
> above which one to call "fatter".

It was your ham, but again I ask what use you would have for the mean and
sdev if I bothered to compute and display them?