[Spambayes-checkins] spambayes Options.py,1.39,1.40 tokenizer.py,1.45,1.46

Skip Montanaro montanaro@users.sourceforge.net
Mon, 30 Sep 2002 14:56:29 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8971

Modified Files:
	Options.py tokenizer.py 
Log Message:
allow users to disable the long word skip tokens (e.g "skip:c 70") under the
assumption that people who do receive mail which contains attachements will
be penalized.



Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.39
retrieving revision 1.40
diff -C2 -d -r1.39 -r1.40
*** Options.py	29 Sep 2002 18:03:39 -0000	1.39
--- Options.py	30 Sep 2002 21:56:27 -0000	1.40
***************
*** 93,96 ****
--- 93,102 ----
  mine_received_headers: False
  
+ # If your ham corpus is generated from sources which contain few, if any
+ # attachments you probably want to leave this alone.  If you have many
+ # legitimate correspondents who send you attachments (Excel spreadsheets,
+ # etc), you might want to set this to False.
+ generate_long_skips: True
+ 
  [TestDriver]
  # These control various displays in class TestDriver.Driver, and Tester.Test.
***************
*** 223,226 ****
--- 229,233 ----
                    'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
+                   'generate_long_skips': boolean_cracker,
                    'mine_received_headers': boolean_cracker,
                    'check_octets': boolean_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.45
retrieving revision 1.46
diff -C2 -d -r1.45 -r1.46
*** tokenizer.py	29 Sep 2002 20:20:57 -0000	1.45
--- tokenizer.py	30 Sep 2002 21:56:27 -0000	1.46
***************
*** 645,649 ****
              # XXX Figure out why, and/or see if some other way of summarizing
              # XXX this info has greater benefit.
!             yield "skip:%c %d" % (word[0], n // 10 * 10)
              if has_highbit_char(word):
                  hicount = 0
--- 645,650 ----
              # XXX Figure out why, and/or see if some other way of summarizing
              # XXX this info has greater benefit.
!             if options.generate_long_skips:
!                 yield "skip:%c %d" % (word[0], n // 10 * 10)
              if has_highbit_char(word):
                  hicount = 0