[spambayes-dev] A URL experiment
Skip Montanaro
skip at pobox.com
Wed Dec 31 10:53:27 EST 2003
Tim> Note that this part of the patch can't be helping much:
Tim> + num_pcs = url.count("%")
Tim> + if num_pcs:
Tim> + pushclue("url:%d %%s" % num_pcs)
Tim> That is, raw counts are almost never useful -- if I have a URL in a
Tim> spam that embeds 40 escapes, that does nothing to indict a URL with
Tim> 39 (or 41) escapes. Pumping out log2(a_count) usually does more
Tim> good.
<aside type="slight">
"url:has user" seems to be fairly spammy for me:
% spamcounts -r -d ~/tmp/hammie.db '^url:has user'
db: /Users/skip/tmp/hammie.db
token,nspam,nham,spam prob
url:has user,42,4,0.91016660508
</aside>
Okay, here are the raw number of URL percents as present in my current
ham/spam database:
npcs nspam nham
1 21 46
2 4 1
3 2 2
4 1 2
5 0 1
6 2 2
7 1 1
8 0 2
14 2 0
15 0 1
16 1 0
18 1 0
23 1 0
24 1 0
28 1 0
30 1 0
38 2 0
40 1 0
42 1 0
74 1 0
75 1 0
84 1 0
97 1 0
103 1 0
109 1 0
191 1 0
I redid my patch to generate tokens like so:
pushclue("url:%%%d" % int(log2(num_pcs)))
Converting the first column to int(log(n,2)) then rebuilding the database
gives:
log(npcs) nspam nham
0 21 46
1 6 3
2 4 2
3 2 2
4 5 0
5 3 0
6 2 0
7 1 0
The new cv test results are essentially the same (I still have just five
sets):
stds.txt -> pickurlss.txt
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
false positive percentages
0.000 0.000 tied
0.400 0.400 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 5 times
lost 0 times
total unique fp went from 1 to 1 tied
mean fp % went from 0.08 to 0.08 tied
false negative percentages
3.333 3.333 tied
5.000 5.000 tied
7.333 7.333 tied
5.667 5.667 tied
4.000 4.000 tied
won 0 times
tied 5 times
lost 0 times
total unique fn went from 76 to 76 tied
mean fn % went from 5.06666666667 to 5.06666666667 tied
ham mean ham sdev
1.64 1.64 +0.00% 8.44 8.45 +0.12%
0.99 0.99 +0.00% 8.29 8.29 +0.00%
2.82 2.82 +0.00% 12.52 12.52 +0.00%
1.58 1.58 +0.00% 8.29 8.29 +0.00%
1.30 1.30 +0.00% 8.04 8.04 +0.00%
ham mean and sdev for all runs
1.66 1.66 +0.00% 9.30 9.30 +0.00%
spam mean spam sdev
93.80 93.83 +0.03% 19.39 19.31 -0.41%
90.56 90.59 +0.03% 24.31 24.26 -0.21%
89.24 89.28 +0.04% 27.03 27.04 +0.04%
89.27 89.27 +0.00% 25.51 25.50 -0.04%
92.72 92.74 +0.02% 21.67 21.67 +0.00%
spam mean and sdev for all runs
91.12 91.14 +0.02% 23.81 23.79 -0.08%
ham/spam mean difference: 89.46 89.48 +0.02
Skip
More information about the spambayes-dev
mailing list