[Spambayes-checkins] spambayes Options.py,1.61,1.62 tokenizer.py,1.52,1.53

Anthony Baxter anthonybaxter@users.sourceforge.net
Mon Oct 28 07:04:15 2002


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv23052

Modified Files:
	Options.py tokenizer.py 
Log Message:
Added skip_max_word_size as an option, to specify how long a word has to be
before it's skipped. I find that boosting from 12 (the default) to 20 makes 
a significant improvement in the number of 'unsure' messages. see my post to
the list for more.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.61
retrieving revision 1.62
diff -C2 -d -r1.61 -r1.62
*** Options.py	27 Oct 2002 22:56:15 -0000	1.61
--- Options.py	28 Oct 2002 07:04:12 -0000	1.62
***************
*** 91,94 ****
--- 91,100 ----
  generate_long_skips: True
  
+ #
+ # Length of words that triggers 'long skips'. Longer than this
+ # triggers a skip.
+ #
+ skip_max_word_size: 12
+ 
  # Generate tokens which resemble the posting time in 6-minute buckets:
  # int((h*60+m)/10).
***************
*** 170,174 ****
  # Display spam when
  #     show_spam_lo <= spamprob <= show_spam_hi
! # and likewise for ham.  The defaults here do not show anything. 
  show_spam_lo: 1.0
  show_spam_hi: 0.0
--- 176,180 ----
  # Display spam when
  #     show_spam_lo <= spamprob <= show_spam_hi
! # and likewise for ham.  The defaults here do not show anything.
  show_spam_lo: 1.0
  show_spam_hi: 0.0
***************
*** 311,314 ****
--- 317,321 ----
                    'count_all_header_lines': boolean_cracker,
                    'generate_long_skips': boolean_cracker,
+                   'skip_max_word_size': int_cracker,
                    'extract_dow': boolean_cracker,
                    'generate_time_buckets': boolean_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.52
retrieving revision 1.53
diff -C2 -d -r1.52 -r1.53
*** tokenizer.py	27 Oct 2002 22:34:08 -0000	1.52
--- tokenizer.py	28 Oct 2002 07:04:12 -0000	1.53
***************
*** 589,596 ****
                  yield "fname piece:" + piece
  
! def tokenize_word(word, _len=len):
      n = _len(word)
      # Make sure this range matches in tokenize().
!     if 3 <= n <= 12:
          yield word
  
--- 589,596 ----
                  yield "fname piece:" + piece
  
! def tokenize_word(word, _len=len, maxword=options.skip_max_word_size):
      n = _len(word)
      # Make sure this range matches in tokenize().
!     if 3 <= n <= maxword:
          yield word