[Spambayes] RE: Further Improvement 2

Sjoerd Mullender sjoerd@acm.org
Mon, 23 Sep 2002 16:04:58 +0200


On Sat, Sep 21 2002 Tim Peters wrote:

> Since I got a big win without any effort <wink> by introducing a brand new
> "ignore probs that aren't at least this far from neutral" knob, that's the
> one I'm most inclined to play with right now.  There isn't a knob in
> existence that won't be played with, but especially large tests take
> significant wall-clock time to complete, and there's only so much testing
> one can do in a day.
> 
> Testers, "a" is already exposed via:
> 
> [Classifier]
> robinson_probability_a: 1.0
> 
> I think values nearer to 0 are most likely to be most interesting.

Here are my results.  The before run has options
"""
[Classifier]
use_robinson_probability: True
use_robinson_combining: True
max_discriminators: 1500
[TestDriver]
spam_cutoff: 0.50
"""
and the after run adds
"""
robinson_probability_a: 0.1
"""
to the set.

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.405  0.540  lost   +33.33%
    0.270  0.135  won    -50.00%

won   1 times
tied  3 times
lost  1 times

total unique fp went from 5 to 5 tied
mean fp % went from 0.134952766532 to 0.134952766532 tied

false negative percentages
    1.571  1.571  tied
    2.618  2.094  won    -20.02%
    0.524  1.047  lost   +99.81%
    1.571  0.000  won   -100.00%
    2.094  1.571  won    -24.98%

won   3 times
tied  1 times
lost  1 times

total unique fn went from 16 to 12 won    -25.00%
mean fn % went from 1.67539267016 to 1.25654450262 won    -25.00%

And the before histograms:

Ham distribution for all runs:
3705 items; mean 23.12; sample sdev 7.44
* = 12 items
  0.00   0 
  2.50   0 
  5.00 105 *********
  7.50  69 ******
 10.00  98 *********
 12.50 306 **************************
 15.00 177 ***************
 17.50 352 ******************************
 20.00 497 ******************************************
 22.50 701 ***********************************************************
 25.00 506 *******************************************
 27.50 354 ******************************
 30.00 220 *******************
 32.50  99 *********
 35.00  93 ********
 37.50  49 *****
 40.00  25 ***
 42.50  26 ***
 45.00  16 **
 47.50   7 *
 50.00   1 *
 52.50   4 *
 55.00   0 

Spam distribution for all runs:
955 items; mean 68.33; sample sdev 8.74
* = 2 items
 32.50   0 
 35.00   1 *
 37.50   2 *
 40.00   0 
 42.50   0 
 45.00   4 **
 47.50   9 *****
 50.00   9 *****
 52.50  18 *********
 55.00  42 *********************
 57.50  73 *************************************
 60.00  88 ********************************************
 62.50 104 ****************************************************
 65.00 111 ********************************************************
 67.50 105 *****************************************************
 70.00 109 *******************************************************
 72.50  53 ***************************
 75.00  76 **************************************
 77.50  62 *******************************
 80.00  37 *******************
 82.50  23 ************
 85.00   8 ****
 87.50   6 ***
 90.00  15 ********
 92.50   0 
 95.00   0 
 97.50   0 

And finally the after histograms:

Ham distribution for all runs:
3705 items; mean 21.33; sample sdev 6.94
* = 12 items
  0.00   0 
  2.50   0 
  5.00 152 *************
  7.50  24 **
 10.00 197 *****************
 12.50 328 ****************************
 15.00 261 **********************
 17.50 498 ******************************************
 20.00 689 **********************************************************
 22.50 610 ***************************************************
 25.00 393 *********************************
 27.50 218 *******************
 30.00 117 **********
 32.50  79 *******
 35.00  71 ******
 37.50  25 ***
 40.00  11 *
 42.50  19 **
 45.00   6 *
 47.50   2 *
 50.00   4 *
 52.50   1 *
 55.00   0 

Spam distribution for all runs:
955 items; mean 72.04; sample sdev 10.31
* = 3 items
 35.00   0 
 37.50   1 *
 40.00   1 *
 42.50   1 *
 45.00   3 *
 47.50   6 **
 50.00  14 *****
 52.50  15 *****
 55.00  30 **********
 57.50  42 **************
 60.00  63 *********************
 62.50  61 *********************
 65.00  74 *************************
 67.50  91 *******************************
 70.00 102 **********************************
 72.50 123 *****************************************
 75.00  69 ***********************
 77.50  49 *****************
 80.00  45 ***************
 82.50  38 *************
 85.00  44 ***************
 87.50  42 **************
 90.00  17 ******
 92.50  12 ****
 95.00  12 ****
 97.50   0 


-- Sjoerd Mullender <sjoerd@acm.org>