[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.15,1.16

Wed, 04 Sep 2002 20:48:30 -0700

Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22985

Modified Files:
	timtest.py 
Log Message:
tokenize_word():  Oops!  This was awfully permissive in what it
took as being "an email address".  Tightened that, and also
avoided 5-gram'ing of email addresses w/ high-bit characters.

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.050  0.050  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.025  0.050  lost
    0.075  0.075  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.050  0.050  tied

won   0 times
tied 19 times
lost  1 times

total unique fp went from 7 to 8

false negative percentages
    0.764  0.691  won
    0.691  0.655  won
    0.981  0.945  won
    1.309  1.309  tied
    1.418  1.164  won
    0.873  0.800  won
    0.800  0.763  won
    1.163  1.163  tied
    1.491  1.345  won
    1.200  1.127  won
    1.381  1.345  won
    1.454  1.490  lost
    1.164  0.909  won
    0.655  0.582  won
    0.655  0.691  lost
    1.163  1.163  tied
    1.200  1.018  won
    0.982  0.873  won
    0.982  0.909  won
    1.236  1.127  won

won  15 times
tied  3 times
lost  2 times

total unique fn went from 260 to 249

Note:  Each of the two losses there consist of just 1 msg difference.
The wins are bigger as well as being more common, and 260-249 = 11
spams no longer sneak by any run (which is more than 4% of the 260
spams that used to sneak thru!).

Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** timtest.py	4 Sep 2002 01:21:20 -0000	1.15
--- timtest.py	5 Sep 2002 03:48:28 -0000	1.16
***************
*** 440,459 ****

      elif n > 2:
!         # A long word.  If there are any high-bit chars,
          # tokenize it as byte 5-grams.
          # XXX This really won't work for high-bit languages -- the scoring
          # XXX scheme throws almost everything away, and one bad phrase can
          # XXX generate enough bad 5-grams to dominate the final score.
!         if has_highbit_char(word):
              for i in xrange(n-4):
                  yield "5g:" + word[i : i+5]

-         elif word.count('@') == 1:
-             # Don't want to skip embedded email addresses.
-             p1, p2 = word.split('@')
-             yield 'email name:' + p1
-             for piece in p2.split('.'):
-                 yield 'email addr:' + piece
- 
          else:
              # It's a long string of "normal" chars.  Ignore it.
--- 440,462 ----

      elif n > 2:
!         # A long word.
! 
!         # Don't want to skip embedded email addresses.
!         if n < 40 and '.' in word and word.count('@') == 1:
!             p1, p2 = word.split('@')
!             yield 'email name:' + p1
!             for piece in p2.split('.'):
!                 yield 'email addr:' + piece
! 
!         # If there are any high-bit chars,
          # tokenize it as byte 5-grams.
          # XXX This really won't work for high-bit languages -- the scoring
          # XXX scheme throws almost everything away, and one bad phrase can
          # XXX generate enough bad 5-grams to dominate the final score.
!         # XXX This also increases the database size substantially.
!         elif has_highbit_char(word):
              for i in xrange(n-4):
                  yield "5g:" + word[i : i+5]

          else:
              # It's a long string of "normal" chars.  Ignore it.
***************
*** 492,499 ****
              yield 'subject:' + t

      # From:
!     for field in ('from',):
          prefix = field + ':'
!         subj = msg.get(field, '')
          for w in subj.lower().split():
              for t in tokenize_word(w):
--- 495,508 ----
              yield 'subject:' + t

+     # Dang -- I can't use Sender:.  If I do,
+     #     'sender:email name:python-list-admin'
+     # becomes the most powerful indicator in the whole database.
+     #
      # From:
!     # Reply-To:
!     # X-Mailer:
!     for field in ('from',):# 'reply-to', 'x-mailer',):
          prefix = field + ':'
!         subj = msg.get(field, '-None-')
          for w in subj.lower().split():
              for t in tokenize_word(w):