[Spambayes-checkins] spambayes timtoken.py,1.4,1.5

Tim Peters tim_one@users.sourceforge.net
Fri, 06 Sep 2002 18:39:57 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv27364

Modified Files:
	timtoken.py 
Log Message:
Comments about how long a word should be; the current values are the
best.


Index: timtoken.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** timtoken.py	7 Sep 2002 00:31:56 -0000	1.4
--- timtoken.py	7 Sep 2002 01:39:55 -0000	1.5
***************
*** 356,359 ****
--- 356,379 ----
  # XXX not to strip HTML from HTML-only msgs should be revisited.
  
+ ##############################################################################
+ # How big should "a word" be?
+ #
+ # As I write this, words less than 3 chars are ignored completely, and words
+ # with more than 12 are special-cased, replaced with a summary "I skipped
+ # about so-and-so many chars starting with such-and-such a letter" token.
+ # This makes sense for English if most of the info is in "regular size"
+ # words.
+ #
+ # A test run boosting to 13 had no effect on f-p rate, and did a little
+ # better or worse than 12 across runs -- overall, no significant difference.
+ # The database size is smaller at 12, so there's nothing in favor of 13.
+ # A test at 11 showed a slight but consistent bad effect on the f-n rate
+ # (lost 12 times, won once, tied 7 times).
+ #
+ # A test with no lower bound showed a significant increase in the f-n rate.
+ # Curious, but not worth digging into.  Boosting the lower bound to 4 is a
+ # worse idea:  f-p and f-n rates both suffered significantly then.  I didn't
+ # try testing with lower bound 2.
+ 
  url_re = re.compile(r"""
      (https? | ftp)  # capture the protocol
***************
*** 386,399 ****
  def tokenize_word(word, _len=len):
      n = _len(word)
- 
-     # XXX How big should "a word" be?
-     # XXX I expect 12 is fine -- a test run boosting to 13 had no effect
-     # XXX on f-p rate, and did a little better or worse than 12 across
-     # XXX runs -- overall, no significant difference.  It's only "common
-     # XXX sense" so far driving the exclusion of lengths 1 and 2.
-     # XXX Later:  A test with no lower bound showed a significant increase
-     # XXX in the f-n rate.  Curious!
-     # XXX Later:  Boosting the lower bound to 4 is a Bad Idea too:  f-p and
-     # XXX f-n rates both suffered then.
  
      # Make sure this range matches in tokenize().
--- 406,409 ----