[Spambayes] RE: spam detection via probability - actual results!

Sjoerd Mullender sjoerd@acm.org
Fri, 20 Sep 2002 11:29:19 +0200


On Fri, Sep 20 2002 Tim Peters wrote:

> [Classifier]
> use_robinson_probability: True
> max_discriminators: 150
> hambias: 1.0
> [TestDriver]
> spam_cutoff: 0.50

Here are my results.  I also have
[Tokenizer]
count_all_header_lines: True
mine_received_headers: True
in both runs.

run1s -> run2s
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams
-> <stat> tested 100 hams & 100 spams against 700 hams & 700 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  1.000  lost  +(was 0)
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  7 times
lost  1 times

total unique fp went from 0 to 1 lost  +(was 0)
mean fp % went from 0.0 to 0.125 lost  +(was 0)

false negative percentages
    2.000  1.000  won    -50.00%
    3.000  2.000  won    -33.33%
    0.000  0.000  tied
    1.000  1.000  tied
    4.000  2.000  won    -50.00%
    0.000  0.000  tied
    1.000  0.000  won   -100.00%
    1.000  1.000  tied

won   4 times
tied  4 times
lost  0 times

total unique fn went from 12 to 7 won    -41.67%
mean fn % went from 1.5 to 0.875 won    -41.67%

with histograms before:

Ham distribution for all runs:
* = 14 items
  0.00 800 **********************************************************
  2.50   0 
[ deleted because all 0 ]

Spam distribution for all runs:
* = 14 items
  0.00  11 *
  2.50   0 
[ deleted because all 0 ]
 82.50   0 
 85.00   1 *
 87.50   0 
 90.00   1 *
 92.50   0 
 95.00   1 *
 97.50 786 *********************************************************

and after:

Ham distribution for all runs:
* = 3 items
  0.00  68 ***********************
  2.50  16 ******
  5.00  13 *****
  7.50  20 *******
 10.00  32 ***********
 12.50  99 *********************************
 15.00  85 *****************************
 17.50 121 *****************************************
 20.00 107 ************************************
 22.50  81 ***************************
 25.00  47 ****************
 27.50  40 **************
 30.00  18 ******
 32.50  20 *******
 35.00  15 *****
 37.50   6 **
 40.00   4 **
 42.50   1 *
 45.00   1 *
 47.50   5 **
 50.00   0 
 52.50   1 *
 55.00   0 
 57.50   0 
 60.00   0 
 62.50   0 
 65.00   0 
 67.50   0 
 70.00   0 
 72.50   0 
 75.00   0 
 77.50   0 
 80.00   0 
 82.50   0 
 85.00   0 
 87.50   0 
 90.00   0 
 92.50   0 
 95.00   0 
 97.50   0 

Spam distribution for all runs:
* = 2 items
  0.00   0 
  2.50   0 
  5.00   0 
  7.50   0 
 10.00   0 
 12.50   0 
 15.00   0 
 17.50   0 
 20.00   0 
 22.50   0 
 25.00   0 
 27.50   0 
 30.00   0 
 32.50   0 
 35.00   0 
 37.50   1 *
 40.00   1 *
 42.50   0 
 45.00   2 *
 47.50   3 **
 50.00   5 ***
 52.50   4 **
 55.00   7 ****
 57.50   9 *****
 60.00  25 *************
 62.50  44 **********************
 65.00  57 *****************************
 67.50  74 *************************************
 70.00  69 ***********************************
 72.50  78 ***************************************
 75.00  52 **************************
 77.50  59 ******************************
 80.00  50 *************************
 82.50  40 ********************
 85.00  40 ********************
 87.50  30 ***************
 90.00  18 *********
 92.50  17 *********
 95.00  10 *****
 97.50 105 *****************************************************

Here are the results if I keep hambias at 2.0:

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  8 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    2.000  3.000  lost   +50.00%
    3.000  4.000  lost   +33.33%
    0.000  0.000  tied
    1.000  1.000  tied
    4.000  5.000  lost   +25.00%
    0.000  0.000  tied
    1.000  3.000  lost  +200.00%
    1.000  2.000  lost  +100.00%

won   0 times
tied  3 times
lost  5 times

total unique fn went from 12 to 18 lost   +50.00%
mean fn % went from 1.5 to 2.25 lost   +50.00%

Only the after histograms:

Ham distribution for all runs:
* = 2 items
  0.00  92 **********************************************
  2.50  47 ************************
  5.00  38 *******************
  7.50  59 ******************************
 10.00  93 ***********************************************
 12.50  90 *********************************************
 15.00 119 ************************************************************
 17.50 108 ******************************************************
 20.00  57 *****************************
 22.50  32 ****************
 25.00  24 ************
 27.50  15 ********
 30.00  11 ******
 32.50   6 ***
 35.00   3 **
 37.50   2 *
 40.00   2 *
 42.50   2 *
 45.00   0 
 47.50   0 
 50.00   0 
 52.50   0 
 55.00   0 
 57.50   0 
 60.00   0 
 62.50   0 
 65.00   0 
 67.50   0 
 70.00   0 
 72.50   0 
 75.00   0 
 77.50   0 
 80.00   0 
 82.50   0 
 85.00   0 
 87.50   0 
 90.00   0 
 92.50   0 
 95.00   0 
 97.50   0 

Spam distribution for all runs:
* = 2 items
  0.00   0 
  2.50   0 
  5.00   0 
  7.50   0 
 10.00   0 
 12.50   0 
 15.00   0 
 17.50   0 
 20.00   0 
 22.50   0 
 25.00   0 
 27.50   0 
 30.00   3 **
 32.50   0 
 35.00   0 
 37.50   1 *
 40.00   2 *
 42.50   2 *
 45.00   6 ***
 47.50   4 **
 50.00  13 *******
 52.50  17 *********
 55.00  29 ***************
 57.50  45 ***********************
 60.00  56 ****************************
 62.50  76 **************************************
 65.00  60 ******************************
 67.50  62 *******************************
 70.00  54 ***************************
 72.50  59 ******************************
 75.00  53 ***************************
 77.50  37 *******************
 80.00  44 **********************
 82.50  20 **********
 85.00  20 **********
 87.50  13 *******
 90.00  10 *****
 92.50   8 ****
 95.00   4 **
 97.50 102 ***************************************************

-- Sjoerd Mullender <sjoerd@acm.org>