[spambayes-dev] Results for DNS lookup in tokenizer
Tony Meyer
tameyer at ihug.co.nz
Tue Apr 13 01:59:34 EDT 2004
Have you tried using the x-slurp_urls option as a solution for this problem?
(I'm not saying it's a better solution, just curious if you have, and if so,
what the results were).
> In case anyone would like to play with it, I'll append my trivial
> patch. It requires pydns from:
>
> http://sourceforge.net/projects/pydns/
This concerns me a bit. I'd want to see really dramatic results before
something in the core distribution required non-standard libraries to be
installed. How complex is the code that the patch is using? Running
timcv.py was *really* slow, too - I don't know whether this was because a
lot of messages timed out, or that the DNS lookup was slow, or what, but it
worries me a bit. Doing the DNS enquiry interactively was very quick, and
at this time of night our DNS server isn't used much at all, so quite
responsive.
Here are my results using timcv.py -n5 with two corpora. First cmp.py
results, then a table.py with just running with defaults as well.
The first one (my wife's mail for the last few months) is a win (-1 fn, -4
unsure). The second one (my work mail for the last few months) is a loss
(two unsure move into fn in one run, the rest unchanged).
Note that in both of these the standard x-pick_apart_urls option does
nothing (good or bad) for me.
-> <stat> tested 101 hams & 358 spams against 398 hams & 1427 spams
-> <stat> tested 100 hams & 359 spams against 399 hams & 1426 spams
-> <stat> tested 100 hams & 358 spams against 399 hams & 1427 spams
-> <stat> tested 99 hams & 353 spams against 400 hams & 1432 spams
-> <stat> tested 99 hams & 357 spams against 400 hams & 1428 spams
-> <stat> tested 101 hams & 358 spams against 398 hams & 1427 spams
-> <stat> tested 100 hams & 359 spams against 399 hams & 1426 spams
-> <stat> tested 100 hams & 358 spams against 399 hams & 1427 spams
-> <stat> tested 99 hams & 353 spams against 400 hams & 1432 spams
-> <stat> tested 99 hams & 357 spams against 400 hams & 1428 spams
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 5 times
lost 0 times
total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied
false negative percentages
0.279 0.279 tied
0.279 0.279 tied
0.559 0.559 tied
2.266 2.266 tied
2.521 2.241 won -11.11%
won 1 times
tied 4 times
lost 0 times
total unique fn went from 21 to 20 won -4.76%
mean fn % went from 1.18076754281 to 1.12474513385 won -4.74%
ham mean ham sdev
0.00 0.01 +(was 0) 0.04 0.04 +0.00%
0.49 0.49 +0.00% 4.91 4.91 +0.00%
0.02 0.01 -50.00% 0.12 0.11 -8.33%
0.03 0.02 -33.33% 0.21 0.21 +0.00%
0.01 0.01 +0.00% 0.08 0.08 +0.00%
ham mean and sdev for all runs
0.11 0.11 +0.00% 2.21 2.21 +0.00%
spam mean spam sdev
96.02 96.11 +0.09% 13.44 13.60 +1.19%
97.15 97.31 +0.16% 11.27 11.10 -1.51%
97.12 97.30 +0.19% 11.86 11.89 +0.25%
94.93 94.92 -0.01% 17.08 17.53 +2.63%
94.99 95.08 +0.09% 17.16 17.26 +0.58%
spam mean and sdev for all runs
96.05 96.15 +0.10% 14.40 14.55 +1.04%
ham/spam mean difference: 95.94 96.04 +0.10
filename: libbys libby_picks libby_pickms
ham:spam: 499:1785 499:1785 499:1785
fp total: 0 0 0
fp %: 0.00 0.00 0.00
fn total: 21 21 20
fn %: 1.18 1.18 1.12
unsure t: 118 119 114
unsure %: 5.17 5.21 4.99
real cost: $44.60 $44.80 $42.80
best cost: $11.80 $11.80 $12.00
h mean: 0.11 0.11 0.11
h sdev: 2.21 2.21 2.21
s mean: 96.04 96.05 96.15
s sdev: 14.40 14.40 14.55
mean diff: 95.93 95.94 96.04
k: 5.78 5.78 5.73
-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 5 times
lost 0 times
total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied
false negative percentages
6.870 6.870 tied
3.125 3.125 tied
7.813 9.375 lost +19.99%
3.906 3.906 tied
5.469 5.469 tied
won 0 times
tied 4 times
lost 1 times
total unique fn went from 35 to 37 lost +5.71%
mean fn % went from 5.43654580153 to 5.74904580153 lost +5.75%
ham mean ham sdev
0.18 0.18 +0.00% 1.77 1.77 +0.00%
0.01 0.01 +0.00% 0.17 0.17 +0.00%
0.01 0.01 +0.00% 0.12 0.12 +0.00%
0.03 0.01 -66.67% 0.39 0.13 -66.67%
0.28 0.29 +3.57% 3.37 3.38 +0.30%
ham mean and sdev for all runs
0.10 0.10 +0.00% 1.72 1.71 -0.58%
spam mean spam sdev
88.89 88.89 +0.00% 25.38 25.48 +0.39%
90.07 90.39 +0.36% 23.20 22.75 -1.94%
87.23 87.13 -0.11% 28.96 29.35 +1.35%
90.79 90.92 +0.14% 23.89 23.80 -0.38%
90.31 90.67 +0.40% 25.99 25.52 -1.81%
spam mean and sdev for all runs
89.46 89.60 +0.16% 25.59 25.52 -0.27%
ham/spam mean difference: 89.36 89.50 +0.14
filename: exchanges exchange_picks
exchange_pickms
ham:spam: 1391:643 1391:643 1391:643
fp total: 0 0 0
fp %: 0.00 0.00 0.00
fn total: 35 35 37
fn %: 5.44 5.44 5.75
unsure t: 83 82 80
unsure %: 4.08 4.03 3.93
real cost: $51.60 $51.40 $53.00
best cost: $33.80 $33.80 $33.00
h mean: 0.10 0.10 0.10
h sdev: 1.72 1.72 1.71
s mean: 89.34 89.46 89.60
s sdev: 25.65 25.59 25.52
mean diff: 89.24 89.36 89.50
k: 3.26 3.27 3.29
=Tony Meyer
More information about the spambayes-dev
mailing list