[Spambayes] There Can Be Only One

Tim Peters tim.one@comcast.net
Thu, 26 Sep 2002 13:29:37 -0400


[Greg Ward]
> ...
> Here are the histograms for run5 (Graham):

Please don't chop off the header line on the histogram.  There's useful info
there.  Also, the individual run histograms aren't *usually* interesting.
The final histogram (which says "all runs" in its header line) is often very
interesting (I think you're actually showing that, but *calling* in "run5").

> ...
> and for run7 (Robinson f(w)):
>
> -> <stat> Ham scores for all runs: 2000 items; mean 20.38; sdev 9.18
> * = 5 items
>   0.00   0
>   2.50  41 *********
>   5.00  96 ********************
>   7.50 144 *****************************
>  10.00 148 ******************************
>  12.50 146 ******************************
>  15.00 206 ******************************************
>  17.50 202 *****************************************
>  20.00 253 ***************************************************
>  22.50 212 *******************************************
>  25.00 173 ***********************************
>  27.50 124 *************************
>  30.00  77 ****************
>  32.50  51 ***********
>  35.00  41 *********
>  37.50  26 ******
>  40.00  13 ***
>  42.50  13 ***
>  45.00  11 ***
>  47.50  11 ***
>  50.00   4 *
>  52.50   4 *
>  55.00   1 *
>  57.50   3 *
>  60.00   0
>  62.50   0
>  65.00   0
>  67.50   0
>  70.00   0
>  72.50   0
>  75.00   0
>  77.50   0
>  80.00   0
>  82.50   0
>  85.00   0
>  87.50   0
>  90.00   0
>  92.50   0
>  95.00   0
>  97.50   0
>
> -> <stat> Spam scores for all runs: 2000 items; mean 79.56; sdev 11.40
> * = 3 items
>   0.00   0
>   2.50   0
>   5.00   0
>   7.50   0
>  10.00   0
>  12.50   0
>  15.00   0
>  17.50   0
>  20.00   0
>  22.50   0
>  25.00   0
>  27.50   0
>  30.00   0
>  32.50   0
>  35.00   0
>  37.50   2 *
>  40.00   1 *
>  42.50   5 **
>  45.00   4 **
>  47.50   6 **
>  50.00   6 **
>  52.50   7 ***
>  55.00  18 ******
>  57.50  41 **************
>  60.00  47 ****************
>  62.50  68 ***********************
>  65.00 105 ***********************************
>  67.50 115 ***************************************
>  70.00 160 ******************************************************
>  72.50 141 ***********************************************
>  75.00 149 **************************************************
>  77.50 128 *******************************************
>  80.00 118 ****************************************
>  82.50 154 ****************************************************
>  85.00 159 *****************************************************
>  87.50 171 *********************************************************
>  90.00 119 ****************************************
>  92.50  85 *****************************
>  95.00  75 *************************
>  97.50 116 ***************************************
> -> best cutoff for all runs: 0.5
> ->     with 12 fp + 18 fn = 30 mistakes
>
> Oops, just noticed the "best cutoff for all runs" thing.  I must have
> misinterpreted the run6 output -- picking 0.475 was an eyeball average.
> D'ohh.

Note that you can also set best_cutoff_fp_weight to tell this histogram
analysis that, e.g., you hate a false positive 10x as much as a false
negative (but you're running older code there; do cvs up first).  Setting
nbuckets higher than the default 40 is helpful here to give the histogram a
finer-grained view of the world.  For example, you have 11 ham and 6 spam in
the .475-.500 bucket, so the finest-grained decision we can make here is
that boosting spam_cutoff to 0.5 would eliminate 11 fp and gain 6 fn.