[Spambayes] There Can Be Only One
Tim Peters
tim.one@comcast.net
Thu, 26 Sep 2002 13:29:37 -0400
[Greg Ward]
> ...
> Here are the histograms for run5 (Graham):
Please don't chop off the header line on the histogram. There's useful info
there. Also, the individual run histograms aren't *usually* interesting.
The final histogram (which says "all runs" in its header line) is often very
interesting (I think you're actually showing that, but *calling* in "run5").
> ...
> and for run7 (Robinson f(w)):
>
> -> <stat> Ham scores for all runs: 2000 items; mean 20.38; sdev 9.18
> * = 5 items
> 0.00 0
> 2.50 41 *********
> 5.00 96 ********************
> 7.50 144 *****************************
> 10.00 148 ******************************
> 12.50 146 ******************************
> 15.00 206 ******************************************
> 17.50 202 *****************************************
> 20.00 253 ***************************************************
> 22.50 212 *******************************************
> 25.00 173 ***********************************
> 27.50 124 *************************
> 30.00 77 ****************
> 32.50 51 ***********
> 35.00 41 *********
> 37.50 26 ******
> 40.00 13 ***
> 42.50 13 ***
> 45.00 11 ***
> 47.50 11 ***
> 50.00 4 *
> 52.50 4 *
> 55.00 1 *
> 57.50 3 *
> 60.00 0
> 62.50 0
> 65.00 0
> 67.50 0
> 70.00 0
> 72.50 0
> 75.00 0
> 77.50 0
> 80.00 0
> 82.50 0
> 85.00 0
> 87.50 0
> 90.00 0
> 92.50 0
> 95.00 0
> 97.50 0
>
> -> <stat> Spam scores for all runs: 2000 items; mean 79.56; sdev 11.40
> * = 3 items
> 0.00 0
> 2.50 0
> 5.00 0
> 7.50 0
> 10.00 0
> 12.50 0
> 15.00 0
> 17.50 0
> 20.00 0
> 22.50 0
> 25.00 0
> 27.50 0
> 30.00 0
> 32.50 0
> 35.00 0
> 37.50 2 *
> 40.00 1 *
> 42.50 5 **
> 45.00 4 **
> 47.50 6 **
> 50.00 6 **
> 52.50 7 ***
> 55.00 18 ******
> 57.50 41 **************
> 60.00 47 ****************
> 62.50 68 ***********************
> 65.00 105 ***********************************
> 67.50 115 ***************************************
> 70.00 160 ******************************************************
> 72.50 141 ***********************************************
> 75.00 149 **************************************************
> 77.50 128 *******************************************
> 80.00 118 ****************************************
> 82.50 154 ****************************************************
> 85.00 159 *****************************************************
> 87.50 171 *********************************************************
> 90.00 119 ****************************************
> 92.50 85 *****************************
> 95.00 75 *************************
> 97.50 116 ***************************************
> -> best cutoff for all runs: 0.5
> -> with 12 fp + 18 fn = 30 mistakes
>
> Oops, just noticed the "best cutoff for all runs" thing. I must have
> misinterpreted the run6 output -- picking 0.475 was an eyeball average.
> D'ohh.
Note that you can also set best_cutoff_fp_weight to tell this histogram
analysis that, e.g., you hate a false positive 10x as much as a false
negative (but you're running older code there; do cvs up first). Setting
nbuckets higher than the default 40 is helpful here to give the histogram a
finer-grained view of the world. For example, you have 11 ham and 6 spam in
the .475-.500 bucket, so the finest-grained decision we can make here is
that boosting spam_cutoff to 0.5 would eliminate 11 fp and gain 6 fn.