[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.12,1.13
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Mon, 02 Sep 2002 19:13:49 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15644
Modified Files:
timtest.py
Log Message:
A reluctant "on principle" change no matter what it does to the stats:
take a stab at removing HTML decorations from plain text msgs. See
comments for why it's *only* in plain text msgs. This puts an end to
false positives due to text msgs talking *about* HTML. Surprisingly, it
also gets rid of some false negatives. Not surprisingly, it introduced
another small class of false positives due to the dumbass regexp trick
used to approximate HTML tag removal removing pieces of text that had
nothing to do with HTML tags (e.g., this happened in the middle of a
uuencoded .py file in such a why that it just happened to leave behind
a string that "looked like" a spam phrase; but before this it looked
like a pile of "too long" lines that didn't generate any tokens --
it's a nonsense outcome either way).
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.025 lost
0.075 0.075 tied
0.050 0.025 won
0.025 0.025 tied
0.000 0.025 lost
0.050 0.075 lost
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 1 times
tied 16 times
lost 3 times
total unique fp went from 8 to 9
false negative percentages
0.945 0.909 won
0.836 0.800 won
1.200 1.091 won
1.418 1.381 won
1.455 1.491 lost
1.091 1.055 won
1.091 0.945 won
1.236 1.236 tied
1.564 1.564 tied
1.236 1.200 won
1.563 1.454 won
1.563 1.599 lost
1.236 1.236 tied
0.836 0.800 won
0.873 0.836 won
1.236 1.236 tied
1.273 1.236 won
1.018 1.055 lost
1.091 1.127 lost
1.490 1.381 won
won 12 times
tied 4 times
lost 4 times
total unique fn went from 292 to 284
Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** timtest.py 2 Sep 2002 19:23:40 -0000 1.12
--- timtest.py 3 Sep 2002 02:13:46 -0000 1.13
***************
*** 52,55 ****
--- 52,56 ----
return text - redundant_html
+ ##############################################################################
# To fold case or not to fold case? I didn't want to fold case, because
# it hides information in English, and I have no idea what .lower() does
***************
*** 79,82 ****
--- 80,84 ----
+ ##############################################################################
# Character n-grams or words?
#
***************
*** 162,165 ****
--- 164,366 ----
# msgs, although 5-grams seems to hate them more.
+
+ ##############################################################################
+ # How to tokenize?
+ #
+ # I started with string.split() merely for speed. Over time I realized it
+ # was making interesting context distinctions qualitatively akin to n-gram
+ # schemes; e.g., "free!!" is a much stronger spam indicator than "free". But
+ # unlike n-grams (whether word- or character- based) under Graham's scoring
+ # scheme, this mild context dependence never seems to go over the edge in
+ # giving "too much" credence to an unlucky phrase.
+ #
+ # OTOH, compared to "searching for words", it increases the size of the
+ # database substantially, less than but close to a factor of 2. This is very
+ # much less than a word bigram scheme bloats it, but as always an increase
+ # isn't justified unless the results are better.
+ #
+ # Following are stats comparing
+ #
+ # for token in text.split(): # left column
+ #
+ # to
+ #
+ # for token in re.findall(r"[\w$\-\x80-\xff]+", text): # right column
+ #
+ # text is case-normalized (text.lower()) in both cases, and the runs were
+ # identical in all other respects. The results clearly favor the split()
+ # gimmick, although they vaguely suggest that some sort of compromise
+ # may do as well with less database burden; e.g., *perhaps* folding runs of
+ # "punctuation" characters into a canonical representative could do that.
+ # But the database size is reasonable without that, and plain split() avoids
+ # having to worry about how to "fold punctuation" in languages other than
+ # English.
+ #
+ # false positive percentages
+ # 0.000 0.000 tied
+ # 0.000 0.050 lost
+ # 0.050 0.150 lost
+ # 0.000 0.025 lost
+ # 0.025 0.050 lost
+ # 0.025 0.075 lost
+ # 0.050 0.150 lost
+ # 0.025 0.000 won
+ # 0.025 0.075 lost
+ # 0.000 0.025 lost
+ # 0.075 0.150 lost
+ # 0.050 0.050 tied
+ # 0.025 0.050 lost
+ # 0.000 0.025 lost
+ # 0.050 0.025 won
+ # 0.025 0.000 won
+ # 0.025 0.025 tied
+ # 0.000 0.025 lost
+ # 0.025 0.075 lost
+ # 0.050 0.175 lost
+ #
+ # won 3 times
+ # tied 3 times
+ # lost 14 times
+ #
+ # total unique fp went from 8 to 20
+ #
+ # false negative percentages
+ # 0.945 1.200 lost
+ # 0.836 1.018 lost
+ # 1.200 1.200 tied
+ # 1.418 1.636 lost
+ # 1.455 1.418 won
+ # 1.091 1.309 lost
+ # 1.091 1.272 lost
+ # 1.236 1.563 lost
+ # 1.564 1.855 lost
+ # 1.236 1.491 lost
+ # 1.563 1.599 lost
+ # 1.563 1.781 lost
+ # 1.236 1.709 lost
+ # 0.836 0.982 lost
+ # 0.873 1.382 lost
+ # 1.236 1.527 lost
+ # 1.273 1.418 lost
+ # 1.018 1.273 lost
+ # 1.091 1.091 tied
+ # 1.490 1.454 won
+ #
+ # won 2 times
+ # tied 2 times
+ # lost 16 times
+ #
+ # total unique fn went from 292 to 302
+
+
+ ##############################################################################
+ # What about HTML?
+ #
+ # Computer geeks seem to view use of HTML in mailing lists and newsgroups as
+ # a mortal sin. Normal people don't, but so it goes: in a technical list/
+ # group, every HTML decoration has spamprob 0.99, there are lots of unique
+ # HTML decorations, and lots of them appear at the very start of the message
+ # so that Graham's scoring scheme latches on to them tight. As a result,
+ # any plain text message just containing an HTML example is likely to be
+ # judged spam (every HTML decoration is an extreme).
+ #
+ # So if a message is multipart/alternative with both text/plain and text/html
+ # branches, we ignore the latter, else newbies would never get a message
+ # through. If a message is just HTML, it has virtually no chance of getting
+ # through.
+ #
+ # In an effort to let normal people use mailing lists too <wink>, and to
+ # alleviate the woes of messages merely *discussing* HTML practice, I
+ # added a gimmick to strip HTML tags after case-normalization and after
+ # special tagging of embedded URLs. This consisted of a regexp sub pattern,
+ # where instances got replaced by single blanks:
+ #
+ # html_re = re.compile(r"""
+ # <
+ # [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
+ # [^>]{0,128} # search for the end '>', but don't chew up the world
+ # >
+ # """, re.VERBOSE)
+ #
+ # and then
+ #
+ # text = html_re.sub(' ', text)
+ #
+ # Alas, little good came of this:
+ #
+ # false positive percentages
+ # 0.000 0.000 tied
+ # 0.000 0.000 tied
+ # 0.050 0.075 lost
+ # 0.000 0.000 tied
+ # 0.025 0.025 tied
+ # 0.025 0.025 tied
+ # 0.050 0.050 tied
+ # 0.025 0.025 tied
+ # 0.025 0.025 tied
+ # 0.000 0.050 lost
+ # 0.075 0.100 lost
+ # 0.050 0.050 tied
+ # 0.025 0.025 tied
+ # 0.000 0.025 lost
+ # 0.050 0.050 tied
+ # 0.025 0.025 tied
+ # 0.025 0.025 tied
+ # 0.000 0.000 tied
+ # 0.025 0.050 lost
+ # 0.050 0.050 tied
+ #
+ # won 0 times
+ # tied 15 times
+ # lost 5 times
+ #
+ # total unique fp went from 8 to 12
+ #
+ # false negative percentages
+ # 0.945 1.164 lost
+ # 0.836 1.418 lost
+ # 1.200 1.272 lost
+ # 1.418 1.272 won
+ # 1.455 1.273 won
+ # 1.091 1.382 lost
+ # 1.091 1.309 lost
+ # 1.236 1.381 lost
+ # 1.564 1.745 lost
+ # 1.236 1.564 lost
+ # 1.563 1.781 lost
+ # 1.563 1.745 lost
+ # 1.236 1.455 lost
+ # 0.836 0.982 lost
+ # 0.873 1.309 lost
+ # 1.236 1.381 lost
+ # 1.273 1.273 tied
+ # 1.018 1.273 lost
+ # 1.091 1.200 lost
+ # 1.490 1.599 lost
+ #
+ # won 2 times
+ # tied 1 times
+ # lost 17 times
+ #
+ # total unique fn went from 292 to 327
+ #
+ # The messages merely discussing HTML were no longer fps, so it did what it
+ # intended there. But the f-n rate nearly doubled on at least one run -- so
+ # strong a set of spam indicators is the mere presence of HTML. The increase
+ # in the number of fps despite that the HTML-discussing msgs left that
+ # category remains mysterious to me, but it wasn't a significant increase
+ # so I let it drop.
+ #
+ # Later: If I simply give up on making mailing lists friendly to my sisters
+ # (they're not nerds, and create wonderfully attractive HTML msgs), a
+ # compromise is to strip HTML tags from only text/plain msgs. That's
+ # principled enough so far as it goes, and eliminates the HTML-discussing
+ # false positives. It remains disturbing that the f-n rate on pure HTML
+ # msgs increases significantly when stripping tags, so the code here doesn't
+ # do that part. However, even after stripping tags, the rates above show that
+ # at least 98% of spams are still correctly identified as spam.
+ # XXX So, if another way is found to slash the f-n rate, the decision here
+ # XXX not to strip HTML from HTML-only msgs should be revisited.
+
url_re = re.compile(r"""
(https? | ftp) # capture the protocol
***************
*** 175,178 ****
--- 376,387 ----
has_highbit_char = re.compile(r"[\x80-\xff]").search
+ # Cheap-ass gimmick to probabilistically find HTML/XML tags.
+ html_re = re.compile(r"""
+ <
+ [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
+ [^>]{0,128} # search for the end '>', but don't run wild
+ >
+ """, re.VERBOSE)
+
# I'm usually just splitting on whitespace, but for subject lines I want to
# break things like "Python/Perl comparison?" up. OTOH, I don't want to
***************
*** 287,290 ****
--- 496,503 ----
for chunk in urlsep_re.split(piece):
yield prefix + chunk
+
+ # Remove HTML/XML tags if it's a plain text message.
+ if part.get_content_type() == "text/plain":
+ text = html_re.sub(' ', text)
# Tokenize everything.