[spambayes-dev] A URL experiment
Kenny Pitt
kennypitt at hotmail.com
Mon Jan 5 16:42:14 EST 2004
Here are my test results against 2021 hams and 1942 spams spread evenly
across 10 sets. The test set comes from a complete capture of my e-mail
stream from a couple of months ago, plus a few more recent mails that
were still lying around in my mail folders and recent training data.
============================================================
Comparison of pick_apart_urls with mine_received_headers set to False:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied
false negative percentages
1.026 1.026 tied
2.051 1.538 won -25.01%
2.577 1.546 won -40.01%
5.155 4.124 won -20.00%
2.062 1.546 won -25.02%
4.639 4.124 won -11.10%
3.608 3.093 won -14.27%
6.186 4.124 won -33.33%
3.093 3.093 tied
3.608 2.577 won -28.58%
won 8 times
tied 2 times
lost 0 times
total unique fn went from 66 to 52 won -21.21%
mean fn % went from 3.40047581285 to 2.67909066878 won -21.21%
ham mean ham sdev
0.34 0.34 +0.00% 4.72 4.78 +1.27%
0.03 0.03 +0.00% 0.38 0.38 +0.00%
0.17 0.19 +11.76% 1.79 1.82 +1.68%
0.08 0.08 +0.00% 0.73 0.75 +2.74%
0.06 0.06 +0.00% 0.64 0.65 +1.56%
0.10 0.10 +0.00% 1.45 1.47 +1.38%
0.02 0.02 +0.00% 0.32 0.32 +0.00%
0.28 0.28 +0.00% 3.93 3.93 +0.00%
0.05 0.05 +0.00% 0.75 0.75 +0.00%
0.00 0.00 +(was 0) 0.00 0.00 +(was 0)
ham mean and sdev for all runs
0.11 0.12 +9.09% 2.12 2.14 +0.94%
spam mean spam sdev
93.87 94.76 +0.95% 16.36 15.16 -7.33%
95.16 95.67 +0.54% 16.65 15.28 -8.23%
93.93 94.92 +1.05% 18.64 16.68 -10.52%
90.62 91.60 +1.08% 24.57 22.95 -6.59%
93.95 94.55 +0.64% 18.31 17.23 -5.90%
91.06 92.13 +1.18% 22.59 21.43 -5.14%
91.77 92.38 +0.66% 21.80 21.14 -3.03%
91.32 92.28 +1.05% 24.35 22.21 -8.79%
92.67 93.66 +1.07% 20.41 19.35 -5.19%
92.45 93.44 +1.07% 21.54 20.09 -6.73%
spam mean and sdev for all runs
92.68 93.54 +0.93% 20.76 19.39 -6.60%
ham/spam mean difference: 92.57 93.42 +0.85
============================================================
Comparison of pick_apart_urls with mine_received_headers set to True:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied
false negative percentages
1.026 0.513 won -50.00%
1.026 0.000 won -100.00%
0.515 0.000 won -100.00%
3.608 2.577 won -28.58%
1.546 1.546 tied
2.577 2.577 tied
3.093 3.093 tied
3.608 2.062 won -42.85%
1.546 1.031 won -33.31%
2.062 1.546 won -25.02%
won 7 times
tied 3 times
lost 0 times
total unique fn went from 40 to 29 won -27.50%
mean fn % went from 2.06079830822 to 1.49458102035 won -27.48%
ham mean ham sdev
0.33 0.34 +3.03% 4.72 4.78 +1.27%
0.00 0.00 +(was 0) 0.03 0.03 +0.00%
0.11 0.12 +9.09% 1.42 1.43 +0.70%
0.00 0.00 +(was 0) 0.03 0.03 +0.00%
0.00 0.00 +(was 0) 0.04 0.04 +0.00%
0.02 0.02 +0.00% 0.21 0.22 +4.76%
0.00 0.00 +(was 0) 0.00 0.00 +(was 0)
0.37 0.37 +0.00% 5.20 5.20 +0.00%
0.00 0.00 +(was 0) 0.00 0.00 +(was 0)
0.00 0.00 +(was 0) 0.00 0.00 +(was 0)
ham mean and sdev for all runs
0.08 0.08 +0.00% 2.27 2.28 +0.44%
spam mean spam sdev
95.88 96.44 +0.58% 13.46 12.43 -7.65%
96.85 97.24 +0.40% 12.37 10.69 -13.58%
96.07 96.71 +0.67% 13.65 12.16 -10.92%
93.32 94.08 +0.81% 20.36 18.68 -8.25%
95.54 95.80 +0.27% 15.56 14.91 -4.18%
94.20 94.72 +0.55% 18.30 17.73 -3.11%
93.52 93.83 +0.33% 19.72 19.28 -2.23%
93.51 94.31 +0.86% 19.99 18.23 -8.80%
94.99 95.46 +0.49% 17.11 16.41 -4.09%
94.95 95.42 +0.49% 17.05 16.01 -6.10%
spam mean and sdev for all runs
94.88 95.40 +0.55% 17.02 15.95 -6.29%
ham/spam mean difference: 94.80 95.32 +0.52
============================================================
And finally, here is the table.py comparison of all four option
combinations:
filename: base pick_apart_urls received+urls
mine_received
ham:spam: 2021:1942 2021:1942
2021:1942 2021:1942
fp total: 0 0 0 0
fp %: 0.00 0.00 0.00 0.00
fn total: 66 52 40 29
fn %: 3.40 2.68 2.06 1.49
unsure t: 200 187 159 155
unsure %: 5.05 4.72 4.01 3.91
real cost: $106.00 $89.40 $71.80 $60.00
best cost: $53.60 $50.00 $41.60 $39.60
h mean: 0.11 0.12 0.08 0.08
h sdev: 2.12 2.14 2.27 2.28
s mean: 92.68 93.54 94.88 95.40
s sdev: 20.76 19.39 17.02 15.95
mean diff: 92.57 93.42 94.80 95.32
k: 4.05 4.34 4.91 5.23
--
Kenny Pitt
More information about the spambayes-dev
mailing list