[spambayes-dev] New results for DNS lookup in tokenizer
Matthew Dixon Cowles
matt at mondoinfo.com
Thu Apr 15 21:37:26 EDT 2004
It turns out that I was right when I speculated that using DNS
lookups would work better on more-recent spam. I re-did my spam sets
from the thousand most recent spams in my spam archive and got rather
better results:
new-pick-aparts.txt -> new-dnss.txt
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 5 times
lost 0 times
total unique fp went from 1 to 1 tied
mean fp % went from 0.1 to 0.1 tied
false negative percentages
0.500 0.000 won -100.00%
4.500 3.500 won -22.22%
1.000 0.500 won -50.00%
0.000 0.000 tied
3.000 2.500 won -16.67%
won 4 times
tied 1 times
lost 0 times
total unique fn went from 18 to 13 won -27.78%
mean fn % went from 1.8 to 1.3 won -27.78%
ham mean ham sdev
0.34 0.33 -2.94% 3.28 3.22 -1.83%
0.14 0.14 +0.00% 1.39 1.36 -2.16%
0.50 0.50 +0.00% 6.84 6.84 +0.00%
0.48 0.31 -35.42% 3.75 2.10 -44.00%
0.35 0.38 +8.57% 3.78 4.15 +9.79%
ham mean and sdev for all runs
0.36 0.33 -8.33% 4.19 4.02 -4.06%
spam mean spam sdev
98.00 98.49 +0.50% 10.43 8.53 -18.22%
94.60 95.38 +0.82% 19.89 18.38 -7.59%
97.52 97.96 +0.45% 11.40 10.63 -6.75%
98.77 98.87 +0.10% 6.47 6.81 +5.26%
94.78 95.38 +0.63% 18.47 17.51 -5.20%
spam mean and sdev for all runs
96.73 97.22 +0.51% 14.37 13.33 -7.24%
ham/spam mean difference: 96.37 96.89 +0.52
In addition, unsures decreased some:
filename: new-pick-apart
new-dns
ham:spam: 1000:1000 1000:1000
fp total: 1 1
fp %: 0.10 0.10
fn total: 18 13
fn %: 1.80 1.30
unsure t: 46 40
unsure %: 2.30 2.00
real cost: $37.20 $31.00
best cost: $21.60 $19.80
h mean: 0.36 0.33
h sdev: 4.19 4.02
s mean: 96.73 97.22
s sdev: 14.37 13.33
mean diff: 96.37 96.89
k: 5.19 5.58
That's not an enormous win but it suggests that I probably am seeing
the improvement in my inbox that I think I'm seeing. And the
false-negatives that are eliminated are nonsense spams or spams with
lots of bland, unrelated text in them.
It's very arguable that a technique that only works well on recent
spam shouldn't be included in SpamBayes until it has proven its value
over some time.
Regards,
Matt
More information about the spambayes-dev
mailing list