[spambayes-dev] A URL experiment

Tony Meyer tameyer at ihug.co.nz
Sun Jan 4 21:10:51 EST 2004


> Happy New Year everyone...

Ditto.

> As Tim predicted, mixing his url cracking ideas with mine 
> leads to better performance than either of our ideas in 
> isolation.  Using the attached patch, I get this summary 
> output for a 10x10 timcv run:

Here's mine, along with a 4 way comparison.  As predicted, my results also
have this combined version as the winner (although the ham mean & stdev go
up).

bases.txt -> pickv2s.txt
-> <stat> tested 357 hams & 395 spams against 3311 hams & 3704 spams
[19 very similar lines snipped]

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.246  0.246  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.557  0.557  tied
    0.559  0.559  tied
    0.287  0.287  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 6 to 6 tied
mean fp % went from 0.164881884948 to 0.164881884948 tied

false negative percentages
    0.253  0.253  tied
    0.781  0.781  tied
    0.462  0.462  tied
    0.756  0.756  tied
    0.243  0.243  tied
    0.247  0.247  tied
    0.240  0.240  tied
    0.494  0.494  tied
    0.973  0.973  tied
    0.454  0.454  tied

won   0 times
tied 10 times
lost  0 times

total unique fn went from 20 to 20 tied
mean fn % went from 0.490257037938 to 0.490257037938 tied

ham mean                     ham sdev
   1.18    1.17   -0.85%        7.76    7.67   -1.16%
   0.99    0.99   +0.00%        6.64    6.64   +0.00%
   0.84    0.85   +1.19%        6.14    6.14   +0.00%
   1.99    2.10   +5.53%        9.46    9.73   +2.85%
   0.49    0.49   +0.00%        3.59    3.58   -0.28%
   0.85    0.89   +4.71%        5.45    5.58   +2.39%
   1.16    1.16   +0.00%        9.30    9.29   -0.11%
   1.20    1.31   +9.17%        8.13    8.68   +6.77%
   1.55    1.55   +0.00%        8.05    8.05   +0.00%
   0.47    0.47   +0.00%        3.22    3.15   -2.17%

ham mean and sdev for all runs
   1.08    1.11   +2.78%        7.13    7.23   +1.40%

spam mean                    spam sdev
  98.75   98.78   +0.03%        8.72    8.56   -1.83%
  97.67   97.71   +0.04%       11.26   11.23   -0.27%
  98.08   98.15   +0.07%       10.12    9.96   -1.58%
  98.16   98.17   +0.01%       10.19   10.17   -0.20%
  98.35   98.42   +0.07%        8.77    8.69   -0.91%
  98.45   98.47   +0.02%        8.97    8.86   -1.23%
  98.35   98.43   +0.08%        9.73    9.65   -0.82%
  98.25   98.36   +0.11%        9.16    8.96   -2.18%
  97.93   97.98   +0.05%       11.99   11.98   -0.08%
  98.92   98.93   +0.01%        7.62    7.63   +0.13%

spam mean and sdev for all runs
  98.30   98.35   +0.05%        9.72    9.64   -0.82%

ham/spam mean difference: 97.22 97.24 +0.02

-> <stat> tested 357 hams & 395 spams against 3311 hams & 3704 spams
[39 very similar lines snipped]

filename:    bases  nntims pickskips
                                   pickv2s
ham:spam:  3668:4099       3668:4099
                   3668:4099       3668:4099
fp total:        6       6       6       6
fp %:         0.16    0.16    0.16    0.16
fn total:       20      20      20      20
fn %:         0.49    0.49    0.49    0.49
unsure t:      178     173     175     172
unsure %:     2.29    2.23    2.25    2.21
real cost: $115.60 $114.60 $115.00 $114.40
best cost:  $93.00  $91.20  $92.40  $91.00
h mean:       1.08    1.10    1.08    1.11
h sdev:       7.13    7.21    7.14    7.23
s mean:      98.30   98.34   98.32   98.35
s sdev:       9.72    9.66    9.68    9.64
mean diff:   97.22   97.24   97.24   97.24
k:            5.77    5.76    5.78    5.76

And with x-use_bigrams:

basebis.txt -> pickv2bis.txt
-> <stat> tested 357 hams & 395 spams against 3311 hams & 3704 spams
[19 very similar lines snipped]

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.279  0.279  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 1 to 1 tied
mean fp % went from 0.0278551532033 to 0.0278551532033 tied

false negative percentages
    0.253  0.253  tied
    1.042  1.042  tied
    0.693  0.462  won    -33.33%
    0.252  0.252  tied
    0.728  0.728  tied
    0.000  0.000  tied
    0.481  0.481  tied
    0.494  0.494  tied
    0.730  0.730  tied
    0.227  0.227  tied

won   1 times
tied  9 times
lost  0 times

total unique fn went from 20 to 19 won     -5.00%
mean fn % went from 0.489899714703 to 0.466805026481 won     -4.71%

ham mean                     ham sdev
   0.95    0.94   -1.05%        6.64    6.60   -0.60%
   0.83    0.83   +0.00%        5.53    5.53   +0.00%
   0.49    0.49   +0.00%        4.08    4.08   +0.00%
   1.53    1.59   +3.92%        8.16    8.42   +3.19%
   0.30    0.29   -3.33%        3.25    3.15   -3.08%
   0.70    0.70   +0.00%        5.27    5.26   -0.19%
   0.85    0.86   +1.18%        7.11    7.12   +0.14%
   0.93    0.96   +3.23%        7.23    7.53   +4.15%
   0.90    0.90   +0.00%        6.47    6.43   -0.62%
   0.41    0.41   +0.00%        4.07    4.06   -0.25%

ham mean and sdev for all runs
   0.80    0.81   +1.25%        6.01    6.07   +1.00%

spam mean                    spam sdev
  98.71   98.74   +0.03%        7.83    7.73   -1.28%
  97.38   97.39   +0.01%       12.55   12.54   -0.08%
  97.78   97.83   +0.05%       11.09   10.74   -3.16%
  97.89   97.91   +0.02%       10.49   10.47   -0.19%
  97.90   97.94   +0.04%       10.03   10.03   +0.00%
  98.32   98.32   +0.00%        8.63    8.60   -0.35%
  98.19   98.23   +0.04%       10.21   10.19   -0.20%
  97.68   97.78   +0.10%       10.99   10.71   -2.55%
  97.86   97.93   +0.07%       11.56   11.54   -0.17%
  98.73   98.74   +0.01%        7.57    7.57   +0.00%

spam mean and sdev for all runs
  98.05   98.09   +0.04%       10.20   10.11   -0.88%

ham/spam mean difference: 97.25 97.28 +0.03

=Tony Meyer




More information about the spambayes-dev mailing list