[spambayes-dev] spammy subject lines
T. Alexander Popiel
popiel at wolfskeep.com
Wed Oct 15 13:48:36 EDT 2003
In message: <1ED4ECF91CDED24C8D012BCF2B034F13026F29A6 at its-xchg4.massey.ac.nz>
"Tony Meyer" <tameyer at ihug.co.nz> writes:
>
>Of course, my data doesn't really tell us anything until we can compare
>it to someone else's...hopefully the OP, at least, will give this a go.
Well, here's my results (sorry about being slow... to get any mails with
the new obfuscated subject lines, I needed to regrab (and reclean) my
corpora into the testing framework):
output/newnormal -> output/subjstrip
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2151 hams & 5576 spams against 19367 hams & 50183 spams
-> <stat> tested 2151 hams & 5575 spams against 19367 hams & 50184 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2151 hams & 5576 spams against 19367 hams & 50183 spams
-> <stat> tested 2151 hams & 5575 spams against 19367 hams & 50184 spams
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.186 0.186 tied
0.093 0.093 tied
0.000 0.000 tied
0.000 0.000 tied
0.093 0.093 tied
0.000 0.000 tied
0.046 0.046 tied
0.046 0.046 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 10 to 10 tied
mean fp % went from 0.0464727221194 to 0.0464727221194 tied
false negative percentages
0.287 0.287 tied
0.412 0.412 tied
0.287 0.269 won -6.27%
0.305 0.305 tied
0.251 0.233 won -7.17%
0.377 0.359 won -4.77%
0.287 0.287 tied
0.377 0.395 lost +4.77%
0.215 0.233 lost +8.37%
0.269 0.269 tied
won 3 times
tied 5 times
lost 2 times
total unique fn went from 171 to 170 won -0.58%
mean fn % went from 0.306676274359 to 0.304882874073 won -0.58%
ham mean ham sdev
0.20 0.21 +5.00% 3.27 3.43 +4.89%
0.20 0.20 +0.00% 2.78 2.78 +0.00%
0.49 0.49 +0.00% 5.61 5.61 +0.00%
0.24 0.24 +0.00% 4.02 4.03 +0.25%
0.13 0.14 +7.69% 2.21 2.21 +0.00%
0.14 0.14 +0.00% 2.68 2.65 -1.12%
0.23 0.24 +4.35% 3.97 4.04 +1.76%
0.14 0.14 +0.00% 2.71 2.70 -0.37%
0.10 0.10 +0.00% 2.63 2.64 +0.38%
0.30 0.31 +3.33% 4.18 4.24 +1.44%
ham mean and sdev for all runs
0.22 0.22 +0.00% 3.55 3.57 +0.56%
spam mean spam sdev
99.02 99.03 +0.01% 7.26 7.21 -0.69%
98.87 98.90 +0.03% 8.18 8.08 -1.22%
99.14 99.16 +0.02% 6.83 6.73 -1.46%
98.90 98.91 +0.01% 7.73 7.70 -0.39%
98.99 99.02 +0.03% 7.00 6.89 -1.57%
98.88 98.89 +0.01% 7.90 7.85 -0.63%
98.96 98.98 +0.02% 7.52 7.46 -0.80%
98.87 98.89 +0.02% 8.05 8.01 -0.50%
99.08 99.09 +0.01% 6.97 6.91 -0.86%
99.15 99.17 +0.02% 6.83 6.75 -1.17%
spam mean and sdev for all runs
98.99 99.00 +0.01% 7.44 7.38 -0.81%
ham/spam mean difference: 98.77 98.78 +0.01
--------
filename: newnormal
subjstrip
ham:spam: 21518:55759
21518:55759
fp total: 10 10
fp %: 0.05 0.05
fn total: 171 170
fn %: 0.31 0.30
unsure t: 1098 1075
unsure %: 1.42 1.39
real cost: $490.60 $485.00
best cost: $395.40 $401.80
h mean: 0.22 0.22
h sdev: 3.55 3.57
s mean: 98.99 99.00
s sdev: 7.44 7.38
mean diff: 98.77 98.78
k: 8.99 9.02
Overall, it looks like a very minor win; it makes the ham less
distinct, but the spam more distinct.
- Alex
More information about the spambayes-dev
mailing list