[spambayes-dev] Wittel/Wu article on statistical attacks

Jeff Epler jepler at unpythonic.net
Thu Sep 9 19:07:00 CEST 2004


I used the "top 100 english words" file referenced in the paper and
checked the nspam vs nham counts in my database.  Some of the words were
very spammy, few were very more than a little bit hammy.

Given this fact it's hard to see that adding "common words" from this
list would effectively bypass my spambayes filter.

It's interesting to note that the subject: tokens are the most extreme
of the lot.  I have no idea what that means, though.

Jeff
------------------------------------------------------------------------

import csv, sets

def main():
    words = sets.Set(open("top100en.txt").read().split("\n"))

    db = csv.reader(open("spambayes.db.flat"))
    output = []
    for line in db:
            if len(line) != 3: continue
            k = line[0]
            h = int(line[1])
            s = int(line[2])
            if k.startswith("subject:"): k1 = k.split(":")[-1]         
            else: k1 = k
            if k1 and (k1 in words):
                output.append((100.*s/(s+h), k, s, h))
    output.sort()
    output.reverse()
    print "100*s/(h+s)          token    h    s"
    print "------------------------------------"
    for row in output:
        print "%5.1f %20s %4d %4d" % row 
main()

100*s/(h+s)          token    h    s
------------------------------------
100.0        subject:years    7    0
100.0        subject:would    1    0
100.0         subject:will    9    0
100.0          subject:who    3    0
100.0         subject:were    2    0
100.0         subject:time   25    0
100.0         subject:they    2    0
100.0        subject:their    7    0
100.0         subject:some    9    0
100.0         subject:said    4    0
100.0      subject:percent    2    0
100.0       subject:people    5    0
100.0         subject:over    4    0
100.0        subject:other    8    0
100.0         subject:only   11    0
100.0         subject:most    5    0
100.0         subject:more   13    0
100.0          subject:may    2    0
100.0       subject:market    3    0
100.0         subject:last    2    0
100.0          subject:his    4    0
100.0          subject:had    5    0
100.0      subject:company    2    0
100.0          subject:but    8    0
100.0      subject:because    1    0
100.0         subject:This   14    0
 96.7          subject:you  118    4
 92.9              percent   13    1
 91.9          subject:can   34    3
 90.6          subject:now   29    3
 89.2         subject:this   66    8
 88.9     subject:software   16    2
 86.7               market   72   11
 85.9           government   67   11
 85.7         subject:into    6    1
 84.9              company  152   27
 84.8         subject:that   28    5
 84.8        subject:about   28    5
 84.8              million   95   17
 84.6         subject:than   11    2
 84.6          subject:out   11    2
 81.8         subject:have   18    4
 81.8          subject:has    9    2
 81.5          subject:any   22    5
 80.0          subject:all    8    2
 78.6                 over  283   77
 77.2                years  129   38
 77.1          subject:The   37   11
 75.0          subject:new   12    4
 75.0        subject:after    3    1
 75.0                  his  177   59
 74.8                 said  104   35
 74.2                  who  308  107
 74.0                 year   71   25
 73.7          subject:New   56   20
 72.1                  now  385  149
 72.1                 most  281  109
 70.9          subject:the  134   55
 70.0         subject:been    7    3
 68.8                 time  350  159
 68.2                  new  457  213
 67.8                 more  713  339
 67.6          subject:was   23   11
 67.0                their  345  170
 66.1                  out  445  228
 66.0                  its  221  114
 65.8                 will  811  422
 65.7               system  237  124
 65.5                  you 1750  922
 65.4                 last  178   94
 65.2                 been  377  201
 65.0                 they  406  219
 63.7                 only  456  260
 63.6          subject:are   35   20
 63.3                  had  207  120
 63.1                  the 1886 1101
 63.0                  can  827  486
 62.9              because  287  169
 62.4                  for 1458  879
 61.8                 than  373  231
 61.6                 were  218  136
 61.6                  all  678  423
 61.5                  and 1690 1058
 61.5                  has  476  298
 61.2          subject:for  170  108
 61.1                 many  214  136
 61.1                 with 1001  637
 61.1                 have  903  576
 61.0                after  199  127
 60.9                  are  827  530
 60.9               people  277  178
 60.6                 also  282  183
 60.4                 from  843  552
 60.4                  one  491  322
 60.2                  two  183  121
 60.2                  may  287  190
 60.1                 into  235  156
 60.0         subject:year    3    2
 60.0        subject:there    6    4
 60.0        subject:could    3    2
 59.9                 this 1099  735
 59.8          subject:and  113   76
 59.1          subject:not   13    9
 59.0             software  121   84
 58.0                about  426  309
 57.3         subject:from   55   41
 56.0          subject:one   14   11
 56.0                 that 1010  795
 55.3         subject:with   47   38
 55.2                  was  554  449
 54.4                  not  796  667
 54.3                 says   51   43
 53.7                first  203  175
 53.3                other  353  309
 52.5                 data   83   75
 52.3                 such  156  142
 51.7                  any  456  426
 51.5                 some  360  339
 50.7                could  232  226
 50.0         subject:also    1    1
 48.0                there  346  375
 47.6         subject:when   10   11
 47.5                 when  347  384
 46.4                 them  148  171
 45.8                which  322  381
 45.8                would  428  507
 45.7                  but  515  613
 41.8                  use  248  345
 32.0         subject:many    8   17
 30.4       subject:system    7   16
 27.3          subject:use    3    8
 20.0         subject:data    2    8
  4.3        subject:which    1   22
  0.0          subject:two    0    4
  0.0         subject:says    0    3
  0.0      subject:million    0    2
  0.0        subject:first    0    1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040909/9668bc12/attachment.pgp


More information about the spambayes-dev mailing list