[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.15,1.16
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Wed, 04 Sep 2002 20:48:30 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22985
Modified Files:
timtest.py
Log Message:
tokenize_word(): Oops! This was awfully permissive in what it
took as being "an email address". Tightened that, and also
avoided 5-gram'ing of email addresses w/ high-bit characters.
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.050 lost
0.075 0.075 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 0 times
tied 19 times
lost 1 times
total unique fp went from 7 to 8
false negative percentages
0.764 0.691 won
0.691 0.655 won
0.981 0.945 won
1.309 1.309 tied
1.418 1.164 won
0.873 0.800 won
0.800 0.763 won
1.163 1.163 tied
1.491 1.345 won
1.200 1.127 won
1.381 1.345 won
1.454 1.490 lost
1.164 0.909 won
0.655 0.582 won
0.655 0.691 lost
1.163 1.163 tied
1.200 1.018 won
0.982 0.873 won
0.982 0.909 won
1.236 1.127 won
won 15 times
tied 3 times
lost 2 times
total unique fn went from 260 to 249
Note: Each of the two losses there consist of just 1 msg difference.
The wins are bigger as well as being more common, and 260-249 = 11
spams no longer sneak by any run (which is more than 4% of the 260
spams that used to sneak thru!).
Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** timtest.py 4 Sep 2002 01:21:20 -0000 1.15
--- timtest.py 5 Sep 2002 03:48:28 -0000 1.16
***************
*** 440,459 ****
elif n > 2:
! # A long word. If there are any high-bit chars,
# tokenize it as byte 5-grams.
# XXX This really won't work for high-bit languages -- the scoring
# XXX scheme throws almost everything away, and one bad phrase can
# XXX generate enough bad 5-grams to dominate the final score.
! if has_highbit_char(word):
for i in xrange(n-4):
yield "5g:" + word[i : i+5]
- elif word.count('@') == 1:
- # Don't want to skip embedded email addresses.
- p1, p2 = word.split('@')
- yield 'email name:' + p1
- for piece in p2.split('.'):
- yield 'email addr:' + piece
-
else:
# It's a long string of "normal" chars. Ignore it.
--- 440,462 ----
elif n > 2:
! # A long word.
!
! # Don't want to skip embedded email addresses.
! if n < 40 and '.' in word and word.count('@') == 1:
! p1, p2 = word.split('@')
! yield 'email name:' + p1
! for piece in p2.split('.'):
! yield 'email addr:' + piece
!
! # If there are any high-bit chars,
# tokenize it as byte 5-grams.
# XXX This really won't work for high-bit languages -- the scoring
# XXX scheme throws almost everything away, and one bad phrase can
# XXX generate enough bad 5-grams to dominate the final score.
! # XXX This also increases the database size substantially.
! elif has_highbit_char(word):
for i in xrange(n-4):
yield "5g:" + word[i : i+5]
else:
# It's a long string of "normal" chars. Ignore it.
***************
*** 492,499 ****
yield 'subject:' + t
# From:
! for field in ('from',):
prefix = field + ':'
! subj = msg.get(field, '')
for w in subj.lower().split():
for t in tokenize_word(w):
--- 495,508 ----
yield 'subject:' + t
+ # Dang -- I can't use Sender:. If I do,
+ # 'sender:email name:python-list-admin'
+ # becomes the most powerful indicator in the whole database.
+ #
# From:
! # Reply-To:
! # X-Mailer:
! for field in ('from',):# 'reply-to', 'x-mailer',):
prefix = field + ':'
! subj = msg.get(field, '-None-')
for w in subj.lower().split():
for t in tokenize_word(w):