[spambayes-dev] A URL experiment

Kenny Pitt kennypitt at hotmail.com
Mon Jan 5 16:42:14 EST 2004

Here are my test results against 2021 hams and 1942 spams spread evenly
across 10 sets.  The test set comes from a complete capture of my e-mail
stream from a couple of months ago, plus a few more recent mails that
were still lying around in my mail folders and recent training data.

Comparison of pick_apart_urls with mine_received_headers set to False:

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied 10 times
lost  0 times

total unique fp went from 0 to 0 tied          
mean fp % went from 0.0 to 0.0 tied          

false negative percentages
    1.026  1.026  tied          
    2.051  1.538  won    -25.01%
    2.577  1.546  won    -40.01%
    5.155  4.124  won    -20.00%
    2.062  1.546  won    -25.02%
    4.639  4.124  won    -11.10%
    3.608  3.093  won    -14.27%
    6.186  4.124  won    -33.33%
    3.093  3.093  tied          
    3.608  2.577  won    -28.58%

won   8 times
tied  2 times
lost  0 times

total unique fn went from 66 to 52 won    -21.21%
mean fn % went from 3.40047581285 to 2.67909066878 won    -21.21%

ham mean                     ham sdev
   0.34    0.34   +0.00%        4.72    4.78   +1.27%
   0.03    0.03   +0.00%        0.38    0.38   +0.00%
   0.17    0.19  +11.76%        1.79    1.82   +1.68%
   0.08    0.08   +0.00%        0.73    0.75   +2.74%
   0.06    0.06   +0.00%        0.64    0.65   +1.56%
   0.10    0.10   +0.00%        1.45    1.47   +1.38%
   0.02    0.02   +0.00%        0.32    0.32   +0.00%
   0.28    0.28   +0.00%        3.93    3.93   +0.00%
   0.05    0.05   +0.00%        0.75    0.75   +0.00%
   0.00    0.00 +(was 0)        0.00    0.00 +(was 0)

ham mean and sdev for all runs
   0.11    0.12   +9.09%        2.12    2.14   +0.94%

spam mean                    spam sdev
  93.87   94.76   +0.95%       16.36   15.16   -7.33%
  95.16   95.67   +0.54%       16.65   15.28   -8.23%
  93.93   94.92   +1.05%       18.64   16.68  -10.52%
  90.62   91.60   +1.08%       24.57   22.95   -6.59%
  93.95   94.55   +0.64%       18.31   17.23   -5.90%
  91.06   92.13   +1.18%       22.59   21.43   -5.14%
  91.77   92.38   +0.66%       21.80   21.14   -3.03%
  91.32   92.28   +1.05%       24.35   22.21   -8.79%
  92.67   93.66   +1.07%       20.41   19.35   -5.19%
  92.45   93.44   +1.07%       21.54   20.09   -6.73%

spam mean and sdev for all runs
  92.68   93.54   +0.93%       20.76   19.39   -6.60%

ham/spam mean difference: 92.57 93.42 +0.85

Comparison of pick_apart_urls with mine_received_headers set to True:

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied 10 times
lost  0 times

total unique fp went from 0 to 0 tied          
mean fp % went from 0.0 to 0.0 tied          

false negative percentages
    1.026  0.513  won    -50.00%
    1.026  0.000  won   -100.00%
    0.515  0.000  won   -100.00%
    3.608  2.577  won    -28.58%
    1.546  1.546  tied          
    2.577  2.577  tied          
    3.093  3.093  tied          
    3.608  2.062  won    -42.85%
    1.546  1.031  won    -33.31%
    2.062  1.546  won    -25.02%

won   7 times
tied  3 times
lost  0 times

total unique fn went from 40 to 29 won    -27.50%
mean fn % went from 2.06079830822 to 1.49458102035 won    -27.48%

ham mean                     ham sdev
   0.33    0.34   +3.03%        4.72    4.78   +1.27%
   0.00    0.00 +(was 0)        0.03    0.03   +0.00%
   0.11    0.12   +9.09%        1.42    1.43   +0.70%
   0.00    0.00 +(was 0)        0.03    0.03   +0.00%
   0.00    0.00 +(was 0)        0.04    0.04   +0.00%
   0.02    0.02   +0.00%        0.21    0.22   +4.76%
   0.00    0.00 +(was 0)        0.00    0.00 +(was 0)
   0.37    0.37   +0.00%        5.20    5.20   +0.00%
   0.00    0.00 +(was 0)        0.00    0.00 +(was 0)
   0.00    0.00 +(was 0)        0.00    0.00 +(was 0)

ham mean and sdev for all runs
   0.08    0.08   +0.00%        2.27    2.28   +0.44%

spam mean                    spam sdev
  95.88   96.44   +0.58%       13.46   12.43   -7.65%
  96.85   97.24   +0.40%       12.37   10.69  -13.58%
  96.07   96.71   +0.67%       13.65   12.16  -10.92%
  93.32   94.08   +0.81%       20.36   18.68   -8.25%
  95.54   95.80   +0.27%       15.56   14.91   -4.18%
  94.20   94.72   +0.55%       18.30   17.73   -3.11%
  93.52   93.83   +0.33%       19.72   19.28   -2.23%
  93.51   94.31   +0.86%       19.99   18.23   -8.80%
  94.99   95.46   +0.49%       17.11   16.41   -4.09%
  94.95   95.42   +0.49%       17.05   16.01   -6.10%

spam mean and sdev for all runs
  94.88   95.40   +0.55%       17.02   15.95   -6.29%

ham/spam mean difference: 94.80 95.32 +0.52

And finally, here is the table.py comparison of all four option

filename:     base pick_apart_urls received+urls
ham:spam:  2021:1942       2021:1942      
                   2021:1942       2021:1942
fp total:        0       0       0       0
fp %:         0.00    0.00    0.00    0.00
fn total:       66      52      40      29
fn %:         3.40    2.68    2.06    1.49
unsure t:      200     187     159     155
unsure %:     5.05    4.72    4.01    3.91
real cost: $106.00  $89.40  $71.80  $60.00
best cost:  $53.60  $50.00  $41.60  $39.60
h mean:       0.11    0.12    0.08    0.08
h sdev:       2.12    2.14    2.27    2.28
s mean:      92.68   93.54   94.88   95.40
s sdev:      20.76   19.39   17.02   15.95
mean diff:   92.57   93.42   94.80   95.32
k:            4.05    4.34    4.91    5.23

Kenny Pitt

More information about the spambayes-dev mailing list