[Spambayes-checkins] spambayes tokenizer.py,1.5,1.6

Tim Peters tim_one@users.sourceforge.net
Sun, 08 Sep 2002 11:54:12 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv32497

Modified Files:
	tokenizer.py 
Log Message:
tokenize():  Stop distinguishing Content-XYZ thingies in the
headers from instances in lower-level MIME sections.  In all,
doing so appears to be just another way of warping the
tokenizer to c.l.py's extreme hatred of HTML.  For example,
'>content-type:text/plain' (lower-level instance) has a spamprob
of 0.85 in my data, but 'content-type:text/plain' (top-level
instance) has spamprob less than 0.25.  A few examples Guido
posted suggest this distinction does more harm on his data
than it does good on mine.  On mine, getting rid of the
distinction makes a tiny difference in the f-n rates; note
that an f-n boost from 0.327% to 0.364% represents a single
msg in my ~2750-msg spam sets:

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.025  0.025  tied
    0.075  0.075  tied
    0.000  0.000  tied
    0.100  0.075  won    -25.00%
    0.050  0.075  lost   +50.00%
    0.025  0.025  tied
    0.025  0.025  tied
    0.050  0.050  tied
    0.050  0.050  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.075  0.075  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.050  0.050  tied

won   1 times
tied 18 times
lost  1 times

total unique fp went from 13 to 12 won     -7.69%

false negative percentages
    0.327  0.327  tied
    0.400  0.400  tied
    0.327  0.364  lost   +11.31%
    0.691  0.691  tied
    0.545  0.545  tied
    0.291  0.291  tied
    0.218  0.291  lost   +33.49%
    0.654  0.618  won     -5.50%
    0.364  0.436  lost   +19.78%
    0.291  0.327  lost   +12.37%
    0.327  0.364  lost   +11.31%
    0.691  0.691  tied
    0.582  0.618  lost    +6.19%
    0.291  0.291  tied
    0.364  0.291  won    -20.05%
    0.436  0.436  tied
    0.436  0.473  lost    +8.49%
    0.218  0.218  tied
    0.291  0.291  tied
    0.254  0.254  tied

won   2 times
tied 11 times
lost  7 times

total unique fn went from 106 to 110 lost    +3.77%



Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** tokenizer.py	8 Sep 2002 17:18:41 -0000	1.5
--- tokenizer.py	8 Sep 2002 18:54:09 -0000	1.6
***************
*** 757,765 ****
  
          # Content-{Type, Disposition} and their params, and charsets.
-         t = ''
          for x in msg.walk():
              for w in crack_content_xyz(x):
!                 yield t + w
!             t = '>'
  
          # Subject:
--- 757,763 ----
  
          # Content-{Type, Disposition} and their params, and charsets.
          for x in msg.walk():
              for w in crack_content_xyz(x):
!                 yield w
  
          # Subject: