[Spambayes-checkins] spambayes tokenizer.py,1.5,1.6
Tim Peters
tim_one@users.sourceforge.net
Sun, 08 Sep 2002 11:54:12 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv32497
Modified Files:
tokenizer.py
Log Message:
tokenize(): Stop distinguishing Content-XYZ thingies in the
headers from instances in lower-level MIME sections. In all,
doing so appears to be just another way of warping the
tokenizer to c.l.py's extreme hatred of HTML. For example,
'>content-type:text/plain' (lower-level instance) has a spamprob
of 0.85 in my data, but 'content-type:text/plain' (top-level
instance) has spamprob less than 0.25. A few examples Guido
posted suggest this distinction does more harm on his data
than it does good on mine. On mine, getting rid of the
distinction makes a tiny difference in the f-n rates; note
that an f-n boost from 0.327% to 0.364% represents a single
msg in my ~2750-msg spam sets:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.025 0.025 tied
0.075 0.075 tied
0.000 0.000 tied
0.100 0.075 won -25.00%
0.050 0.075 lost +50.00%
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.050 0.050 tied
0.050 0.050 tied
0.000 0.000 tied
0.000 0.000 tied
0.075 0.075 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 1 times
tied 18 times
lost 1 times
total unique fp went from 13 to 12 won -7.69%
false negative percentages
0.327 0.327 tied
0.400 0.400 tied
0.327 0.364 lost +11.31%
0.691 0.691 tied
0.545 0.545 tied
0.291 0.291 tied
0.218 0.291 lost +33.49%
0.654 0.618 won -5.50%
0.364 0.436 lost +19.78%
0.291 0.327 lost +12.37%
0.327 0.364 lost +11.31%
0.691 0.691 tied
0.582 0.618 lost +6.19%
0.291 0.291 tied
0.364 0.291 won -20.05%
0.436 0.436 tied
0.436 0.473 lost +8.49%
0.218 0.218 tied
0.291 0.291 tied
0.254 0.254 tied
won 2 times
tied 11 times
lost 7 times
total unique fn went from 106 to 110 lost +3.77%
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** tokenizer.py 8 Sep 2002 17:18:41 -0000 1.5
--- tokenizer.py 8 Sep 2002 18:54:09 -0000 1.6
***************
*** 757,765 ****
# Content-{Type, Disposition} and their params, and charsets.
- t = ''
for x in msg.walk():
for w in crack_content_xyz(x):
! yield t + w
! t = '>'
# Subject:
--- 757,763 ----
# Content-{Type, Disposition} and their params, and charsets.
for x in msg.walk():
for w in crack_content_xyz(x):
! yield w
# Subject: