[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.5,1.6

Sat, 31 Aug 2002 21:42:54 -0700

Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18035

Modified Files:
	timtest.py 
Log Message:
Long new comment block summarizing all my experiments with character
n-grams.  Bottom line is that they have nothing going for them, and a
lot going against them, under Graham's scheme.  I believe there may
still be a place for them in *part* of a word-based tokenizer, though.


Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** timtest.py	31 Aug 2002 21:33:10 -0000	1.5
--- timtest.py	1 Sep 2002 04:42:51 -0000	1.6
***************
*** 57,71 ****
      return text - redundant_html
  
- url_re = re.compile(r"""
-     (https? | ftp)  # capture the protocol
-     ://             # skip the boilerplate
-     # Do a reasonable attempt at detecting the end.  It may or may not
-     # be in HTML, may or may not be in quotes, etc.  If it's full of %
-     # escapes, cool -- that's a clue too.
-     ([^\s<>'"\x7f-\xff]+)  # capture the guts
- """, re.IGNORECASE | re.VERBOSE)
- 
- urlsep_re = re.compile(r"[;?:@&=+,$.]")
- 
  # To fold case or not to fold case?  I didn't want to fold case, because
  # it hides information in English, and I have no idea what .lower() does
--- 57,60 ----
***************
*** 93,96 ****
--- 82,180 ----
  # Talk about "money" and "lucrative" is indistinguishable now from talk
  # about "MONEY" and "LUCRATIVE", and spam mentions MONEY a lot.
+ 
+ 
+ # Character n-grams or words?
+ #
+ # With careful multiple-corpora c.l.py tests sticking to case-folded decoded
+ # text-only portions, and ignoring headers, and with identical special
+ # parsing & tagging of embedded URLs:
+ #
+ # Character 3-grams gave 5x as many false positives as split-on-whitespace
+ # (s-o-w).  The f-n rate was also significantly worse, but within a factor
+ # of 2.  So character 3-grams lost across the board.
+ #
+ # Character 5-grams gave 32% more f-ps than split-on-whitespace, but the
+ # s-o-w fp rate across 20,000 presumed-hams was 0.1%, and this is the
+ # difference between 23 and 34 f-ps.  There aren't enough there to say that's
+ # significnatly more with killer-high confidence.  There were plenty of f-ns,
+ # though, and the f-n rate with character 5-grams was substantially *worse*
+ # than with character 3-grams (which in turn was substantially worse than
+ # with s-o-w).
+ #
+ # Training on character 5-grams creates many more unique tokens than s-o-w:
+ # a typical run bloated to 150MB process size.  It also ran a lot slower than
+ # s-o-w, partly related to heavy indexing of a huge out-of-cache wordinfo
+ # dict.  I rarely noticed disk activity when running s-o-w, so rarely bothered
+ # to look at process size; it was under 30MB last time I looked.
+ #
+ # Figuring out *why* a msg scored as it did proved much more mysterious when
+ # working with character n-grams:  they often had no obvious "meaning".  In
+ # contrast, it was always easy to figure out what s-o-w was picking up on.
+ # 5-grams flagged a msg from Christian Tismer as spam, where he was discussing
+ # the speed of tasklets under his new implementation of stackless:
+ #
+ #     prob = 0.99999998959
+ #     prob('ed sw') = 0.01
+ #     prob('http0:pgp') = 0.01
+ #     prob('http0:python') = 0.01
+ #     prob('hlon ') = 0.99
+ #     prob('http0:wwwkeys') = 0.01
+ #     prob('http0:starship') = 0.01
+ #     prob('http0:stackless') = 0.01
+ #     prob('n xp ') = 0.99
+ #     prob('on xp') = 0.99
+ #     prob('p 150') = 0.99
+ #     prob('lon x') = 0.99
+ #     prob(' amd ') = 0.99
+ #     prob(' xp 1') = 0.99
+ #     prob(' athl') = 0.99
+ #     prob('1500+') = 0.99
+ #     prob('xp 15') = 0.99
+ #
+ # The spam decision was baffling until I realized that *all* the high-
+ # probablity spam 5-grams there came out of a single phrase:
+ #
+ #     AMD Athlon XP 1500+
+ #
+ # So Christian was punished for using a machine lots of spam tries to sell
+ # <wink>.  In a classic Bayesian classifier, this probably wouldn't have
+ # mattered, but Graham's throws away almost all the 5-grams from a msg,
+ # saving only the about-a-dozen farthest from a neutral 0.5.  So one bad
+ # phrase can kill you!  This appears to happen very rarely, but happened
+ # more than once.
+ #
+ # The conclusion is that character n-grams have almost nothing to recommend
+ # them under Graham's scheme:  harder to work with, slower, much larger
+ # database, worse results, and prone to rare mysterious disasters.
+ #
+ # There's one area they won hands-down:  detecting spam in what I assume are
+ # Asian languages.  The s-o-w scheme sometimes finds only line-ends to split
+ # on then, and then a "hey, this 'word' is way too big!  let's ignore it"
+ # gimmick kicks in, and produces no tokens at all.
+ #
+ # XXX Try producing character n-grams then under the s-o-w scheme, instead
+ # XXX of igoring the blob.  This was too unattractive before because we
+ # XXX weren't# decoding base64 or qp.  We're still not decoding uuencoded
+ # XXX stuff.  So try this only if there are high-bit characters in the blob.
+ #
+ # Interesting:  despite that odd example above, the *kinds* of f-p mistakes
+ # 5-grams made were very much like s-o-w made -- I recognized almost all of
+ # the 5-gram f-p messages from previous s-o-w runs.  For example, both
+ # schemes have a particular hatred for conference announcements, although
+ # s-o-w stopped hating them after folding case.  But 5-grams still hate them.
+ # Both schemes also hate msgs discussing HTML with examples, with about equal
+ # passion.   Both schemes hate brief  "please subscribe [unsubscribe] me"
+ # msgs, although 5-grams seems to hate them more.
+ 
+ url_re = re.compile(r"""
+     (https? | ftp)  # capture the protocol
+     ://             # skip the boilerplate
+     # Do a reasonable attempt at detecting the end.  It may or may not
+     # be in HTML, may or may not be in quotes, etc.  If it's full of %
+     # escapes, cool -- that's a clue too.
+     ([^\s<>'"\x7f-\xff]+)  # capture the guts
+ """, re.IGNORECASE | re.VERBOSE)
+ 
+ urlsep_re = re.compile(r"[;?:@&=+,$.]")
  
  def tokenize(string):