[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.12,1.13

Mon, 02 Sep 2002 19:13:49 -0700

Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15644

Modified Files:
	timtest.py 
Log Message:
A reluctant "on principle" change no matter what it does to the stats:
take a stab at removing HTML decorations from plain text msgs.  See
comments for why it's *only* in plain text msgs.  This puts an end to
false positives due to text msgs talking *about* HTML.  Surprisingly, it
also gets rid of some false negatives.  Not surprisingly, it introduced
another small class of false positives due to the dumbass regexp trick
used to approximate HTML tag removal removing pieces of text that had
nothing to do with HTML tags (e.g., this happened in the middle of a
uuencoded .py file in such a why that it just happened to leave behind
a string that "looked like" a spam phrase; but before this it looked
like a pile of "too long" lines that didn't generate any tokens --
it's a nonsense outcome either way).


false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.050  0.050  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.000  0.025  lost
    0.075  0.075  tied
    0.050  0.025  won
    0.025  0.025  tied
    0.000  0.025  lost
    0.050  0.075  lost
    0.025  0.025  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.050  0.050  tied

won   1 times
tied 16 times
lost  3 times

total unique fp went from 8 to 9

false negative percentages
    0.945  0.909  won
    0.836  0.800  won
    1.200  1.091  won
    1.418  1.381  won
    1.455  1.491  lost
    1.091  1.055  won
    1.091  0.945  won
    1.236  1.236  tied
    1.564  1.564  tied
    1.236  1.200  won
    1.563  1.454  won
    1.563  1.599  lost
    1.236  1.236  tied
    0.836  0.800  won
    0.873  0.836  won
    1.236  1.236  tied
    1.273  1.236  won
    1.018  1.055  lost
    1.091  1.127  lost
    1.490  1.381  won

won  12 times
tied  4 times
lost  4 times

total unique fn went from 292 to 284


Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** timtest.py	2 Sep 2002 19:23:40 -0000	1.12
--- timtest.py	3 Sep 2002 02:13:46 -0000	1.13
***************
*** 52,55 ****
--- 52,56 ----
      return text - redundant_html
  
+ ##############################################################################
  # To fold case or not to fold case?  I didn't want to fold case, because
  # it hides information in English, and I have no idea what .lower() does
***************
*** 79,82 ****
--- 80,84 ----
  
  
+ ##############################################################################
  # Character n-grams or words?
  #
***************
*** 162,165 ****
--- 164,366 ----
  # msgs, although 5-grams seems to hate them more.
  
+ 
+ ##############################################################################
+ # How to tokenize?
+ #
+ # I started with string.split() merely for speed.  Over time I realized it
+ # was making interesting context distinctions qualitatively akin to n-gram
+ # schemes; e.g., "free!!" is a much stronger spam indicator than "free".  But
+ # unlike n-grams (whether word- or character- based) under Graham's scoring
+ # scheme, this mild context dependence never seems to go over the edge in
+ # giving "too much" credence to an unlucky phrase.
+ #
+ # OTOH, compared to "searching for words", it increases the size of the
+ # database substantially, less than but close to a factor of 2.  This is very
+ # much less than a word bigram scheme bloats it, but as always an increase
+ # isn't justified unless the results are better.
+ #
+ # Following are stats comparing
+ #
+ #    for token in text.split():  # left column
+ #
+ # to
+ #
+ #    for token in re.findall(r"[\w$\-\x80-\xff]+", text):  # right column
+ #
+ # text is case-normalized (text.lower()) in both cases, and the runs were
+ # identical in all other respects.  The results clearly favor the split()
+ # gimmick, although they vaguely suggest that some sort of compromise
+ # may do as well with less database burden; e.g., *perhaps* folding runs of
+ # "punctuation" characters into a canonical representative could do that.
+ # But the database size is reasonable without that, and plain split() avoids
+ # having to worry about how to "fold punctuation" in languages other than
+ # English.
+ #
+ #    false positive percentages
+ #        0.000  0.000  tied
+ #        0.000  0.050  lost
+ #        0.050  0.150  lost
+ #        0.000  0.025  lost
+ #        0.025  0.050  lost
+ #        0.025  0.075  lost
+ #        0.050  0.150  lost
+ #        0.025  0.000  won
+ #        0.025  0.075  lost
+ #        0.000  0.025  lost
+ #        0.075  0.150  lost
+ #        0.050  0.050  tied
+ #        0.025  0.050  lost
+ #        0.000  0.025  lost
+ #        0.050  0.025  won
+ #        0.025  0.000  won
+ #        0.025  0.025  tied
+ #        0.000  0.025  lost
+ #        0.025  0.075  lost
+ #        0.050  0.175  lost
+ #
+ #    won   3 times
+ #    tied  3 times
+ #    lost 14 times
+ #
+ #    total unique fp went from 8 to 20
+ #
+ #    false negative percentages
+ #        0.945  1.200  lost
+ #        0.836  1.018  lost
+ #        1.200  1.200  tied
+ #        1.418  1.636  lost
+ #        1.455  1.418  won
+ #        1.091  1.309  lost
+ #        1.091  1.272  lost
+ #        1.236  1.563  lost
+ #        1.564  1.855  lost
+ #        1.236  1.491  lost
+ #        1.563  1.599  lost
+ #        1.563  1.781  lost
+ #        1.236  1.709  lost
+ #        0.836  0.982  lost
+ #        0.873  1.382  lost
+ #        1.236  1.527  lost
+ #        1.273  1.418  lost
+ #        1.018  1.273  lost
+ #        1.091  1.091  tied
+ #        1.490  1.454  won
+ #
+ #    won   2 times
+ #    tied  2 times
+ #    lost 16 times
+ #
+ #    total unique fn went from 292 to 302
+ 
+ 
+ ##############################################################################
+ # What about HTML?
+ #
+ # Computer geeks seem to view use of HTML in mailing lists and newsgroups as
+ # a mortal sin.  Normal people don't, but so it goes:  in a technical list/
+ # group, every HTML decoration has spamprob 0.99, there are lots of unique
+ # HTML decorations, and lots of them appear at the very start of the message
+ # so that Graham's scoring scheme latches on to them tight.  As a result,
+ # any plain text message just containing an HTML example is likely to be
+ # judged spam (every HTML decoration is an extreme).
+ #
+ # So if a message is multipart/alternative with both text/plain and text/html
+ # branches, we ignore the latter, else newbies would never get a message
+ # through.  If a message is just HTML, it has virtually no chance of getting
+ # through.
+ #
+ # In an effort to let normal people use mailing lists too <wink>, and to
+ # alleviate the woes of messages merely *discussing* HTML practice, I
+ # added a gimmick to strip HTML tags after case-normalization and after
+ # special tagging of embedded URLs.  This consisted of a regexp sub pattern,
+ # where instances got replaced by single blanks:
+ #
+ #    html_re = re.compile(r"""
+ #        <
+ #        [^\s<>]     # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
+ #        [^>]{0,128} # search for the end '>', but don't chew up the world
+ #        >
+ #    """, re.VERBOSE)
+ #
+ # and then
+ #
+ #    text = html_re.sub(' ', text)
+ #
+ # Alas, little good came of this:
+ #
+ #    false positive percentages
+ #        0.000  0.000  tied
+ #        0.000  0.000  tied
+ #        0.050  0.075  lost
+ #        0.000  0.000  tied
+ #        0.025  0.025  tied
+ #        0.025  0.025  tied
+ #        0.050  0.050  tied
+ #        0.025  0.025  tied
+ #        0.025  0.025  tied
+ #        0.000  0.050  lost
+ #        0.075  0.100  lost
+ #        0.050  0.050  tied
+ #        0.025  0.025  tied
+ #        0.000  0.025  lost
+ #        0.050  0.050  tied
+ #        0.025  0.025  tied
+ #        0.025  0.025  tied
+ #        0.000  0.000  tied
+ #        0.025  0.050  lost
+ #        0.050  0.050  tied
+ #
+ #    won   0 times
+ #    tied 15 times
+ #    lost  5 times
+ #
+ #    total unique fp went from 8 to 12
+ #
+ #    false negative percentages
+ #        0.945  1.164  lost
+ #        0.836  1.418  lost
+ #        1.200  1.272  lost
+ #        1.418  1.272  won
+ #        1.455  1.273  won
+ #        1.091  1.382  lost
+ #        1.091  1.309  lost
+ #        1.236  1.381  lost
+ #        1.564  1.745  lost
+ #        1.236  1.564  lost
+ #        1.563  1.781  lost
+ #        1.563  1.745  lost
+ #        1.236  1.455  lost
+ #        0.836  0.982  lost
+ #        0.873  1.309  lost
+ #        1.236  1.381  lost
+ #        1.273  1.273  tied
+ #        1.018  1.273  lost
+ #        1.091  1.200  lost
+ #        1.490  1.599  lost
+ #
+ #    won   2 times
+ #    tied  1 times
+ #    lost 17 times
+ #
+ #    total unique fn went from 292 to 327
+ #
+ # The messages merely discussing HTML were no longer fps, so it did what it
+ # intended there.  But the f-n rate nearly doubled on at least one run -- so
+ # strong a set of spam indicators is the mere presence of HTML.  The increase
+ # in the number of fps despite that the HTML-discussing msgs left that
+ # category remains mysterious to me, but it wasn't a significant increase
+ # so I let it drop.
+ #
+ # Later:  If I simply give up on making mailing lists friendly to my sisters
+ # (they're not nerds, and create wonderfully attractive HTML msgs), a
+ # compromise is to strip HTML tags from only text/plain msgs.  That's
+ # principled enough so far as it goes, and eliminates the HTML-discussing
+ # false positives.  It remains disturbing that the f-n rate on pure HTML
+ # msgs increases significantly when stripping tags, so the code here doesn't
+ # do that part.  However, even after stripping tags, the rates above show that
+ # at least 98% of spams are still correctly identified as spam.
+ # XXX So, if another way is found to slash the f-n rate, the decision here
+ # XXX not to strip HTML from HTML-only msgs should be revisited.
+ 
  url_re = re.compile(r"""
      (https? | ftp)  # capture the protocol
***************
*** 175,178 ****
--- 376,387 ----
  has_highbit_char = re.compile(r"[\x80-\xff]").search
  
+ # Cheap-ass gimmick to probabilistically find HTML/XML tags.
+ html_re = re.compile(r"""
+     <
+     [^\s<>]     # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
+     [^>]{0,128} # search for the end '>', but don't run wild
+     >
+ """, re.VERBOSE)
+ 
  # I'm usually just splitting on whitespace, but for subject lines I want to
  # break things like "Python/Perl comparison?" up.  OTOH, I don't want to
***************
*** 287,290 ****
--- 496,503 ----
                  for chunk in urlsep_re.split(piece):
                      yield prefix + chunk
+ 
+         # Remove HTML/XML tags if it's a plain text message.
+         if part.get_content_type() == "text/plain":
+             text = html_re.sub(' ', text)
  
          # Tokenize everything.