[spambayes-dev] Wittel/Wu article on statistical attacks
Jeff Epler
jepler at unpythonic.net
Thu Sep 9 19:07:00 CEST 2004
I used the "top 100 english words" file referenced in the paper and
checked the nspam vs nham counts in my database. Some of the words were
very spammy, few were very more than a little bit hammy.
Given this fact it's hard to see that adding "common words" from this
list would effectively bypass my spambayes filter.
It's interesting to note that the subject: tokens are the most extreme
of the lot. I have no idea what that means, though.
Jeff
------------------------------------------------------------------------
import csv, sets
def main():
words = sets.Set(open("top100en.txt").read().split("\n"))
db = csv.reader(open("spambayes.db.flat"))
output = []
for line in db:
if len(line) != 3: continue
k = line[0]
h = int(line[1])
s = int(line[2])
if k.startswith("subject:"): k1 = k.split(":")[-1]
else: k1 = k
if k1 and (k1 in words):
output.append((100.*s/(s+h), k, s, h))
output.sort()
output.reverse()
print "100*s/(h+s) token h s"
print "------------------------------------"
for row in output:
print "%5.1f %20s %4d %4d" % row
main()
100*s/(h+s) token h s
------------------------------------
100.0 subject:years 7 0
100.0 subject:would 1 0
100.0 subject:will 9 0
100.0 subject:who 3 0
100.0 subject:were 2 0
100.0 subject:time 25 0
100.0 subject:they 2 0
100.0 subject:their 7 0
100.0 subject:some 9 0
100.0 subject:said 4 0
100.0 subject:percent 2 0
100.0 subject:people 5 0
100.0 subject:over 4 0
100.0 subject:other 8 0
100.0 subject:only 11 0
100.0 subject:most 5 0
100.0 subject:more 13 0
100.0 subject:may 2 0
100.0 subject:market 3 0
100.0 subject:last 2 0
100.0 subject:his 4 0
100.0 subject:had 5 0
100.0 subject:company 2 0
100.0 subject:but 8 0
100.0 subject:because 1 0
100.0 subject:This 14 0
96.7 subject:you 118 4
92.9 percent 13 1
91.9 subject:can 34 3
90.6 subject:now 29 3
89.2 subject:this 66 8
88.9 subject:software 16 2
86.7 market 72 11
85.9 government 67 11
85.7 subject:into 6 1
84.9 company 152 27
84.8 subject:that 28 5
84.8 subject:about 28 5
84.8 million 95 17
84.6 subject:than 11 2
84.6 subject:out 11 2
81.8 subject:have 18 4
81.8 subject:has 9 2
81.5 subject:any 22 5
80.0 subject:all 8 2
78.6 over 283 77
77.2 years 129 38
77.1 subject:The 37 11
75.0 subject:new 12 4
75.0 subject:after 3 1
75.0 his 177 59
74.8 said 104 35
74.2 who 308 107
74.0 year 71 25
73.7 subject:New 56 20
72.1 now 385 149
72.1 most 281 109
70.9 subject:the 134 55
70.0 subject:been 7 3
68.8 time 350 159
68.2 new 457 213
67.8 more 713 339
67.6 subject:was 23 11
67.0 their 345 170
66.1 out 445 228
66.0 its 221 114
65.8 will 811 422
65.7 system 237 124
65.5 you 1750 922
65.4 last 178 94
65.2 been 377 201
65.0 they 406 219
63.7 only 456 260
63.6 subject:are 35 20
63.3 had 207 120
63.1 the 1886 1101
63.0 can 827 486
62.9 because 287 169
62.4 for 1458 879
61.8 than 373 231
61.6 were 218 136
61.6 all 678 423
61.5 and 1690 1058
61.5 has 476 298
61.2 subject:for 170 108
61.1 many 214 136
61.1 with 1001 637
61.1 have 903 576
61.0 after 199 127
60.9 are 827 530
60.9 people 277 178
60.6 also 282 183
60.4 from 843 552
60.4 one 491 322
60.2 two 183 121
60.2 may 287 190
60.1 into 235 156
60.0 subject:year 3 2
60.0 subject:there 6 4
60.0 subject:could 3 2
59.9 this 1099 735
59.8 subject:and 113 76
59.1 subject:not 13 9
59.0 software 121 84
58.0 about 426 309
57.3 subject:from 55 41
56.0 subject:one 14 11
56.0 that 1010 795
55.3 subject:with 47 38
55.2 was 554 449
54.4 not 796 667
54.3 says 51 43
53.7 first 203 175
53.3 other 353 309
52.5 data 83 75
52.3 such 156 142
51.7 any 456 426
51.5 some 360 339
50.7 could 232 226
50.0 subject:also 1 1
48.0 there 346 375
47.6 subject:when 10 11
47.5 when 347 384
46.4 them 148 171
45.8 which 322 381
45.8 would 428 507
45.7 but 515 613
41.8 use 248 345
32.0 subject:many 8 17
30.4 subject:system 7 16
27.3 subject:use 3 8
20.0 subject:data 2 8
4.3 subject:which 1 22
0.0 subject:two 0 4
0.0 subject:says 0 3
0.0 subject:million 0 2
0.0 subject:first 0 1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040909/9668bc12/attachment.pgp
More information about the spambayes-dev
mailing list