[Spambayes-checkins] spambayes Options.py,1.39,1.40
tokenizer.py,1.45,1.46
Skip Montanaro
montanaro@users.sourceforge.net
Mon, 30 Sep 2002 14:56:29 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8971
Modified Files:
Options.py tokenizer.py
Log Message:
allow users to disable the long word skip tokens (e.g "skip:c 70") under the
assumption that people who do receive mail which contains attachements will
be penalized.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.39
retrieving revision 1.40
diff -C2 -d -r1.39 -r1.40
*** Options.py 29 Sep 2002 18:03:39 -0000 1.39
--- Options.py 30 Sep 2002 21:56:27 -0000 1.40
***************
*** 93,96 ****
--- 93,102 ----
mine_received_headers: False
+ # If your ham corpus is generated from sources which contain few, if any
+ # attachments you probably want to leave this alone. If you have many
+ # legitimate correspondents who send you attachments (Excel spreadsheets,
+ # etc), you might want to set this to False.
+ generate_long_skips: True
+
[TestDriver]
# These control various displays in class TestDriver.Driver, and Tester.Test.
***************
*** 223,226 ****
--- 229,233 ----
'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
+ 'generate_long_skips': boolean_cracker,
'mine_received_headers': boolean_cracker,
'check_octets': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.45
retrieving revision 1.46
diff -C2 -d -r1.45 -r1.46
*** tokenizer.py 29 Sep 2002 20:20:57 -0000 1.45
--- tokenizer.py 30 Sep 2002 21:56:27 -0000 1.46
***************
*** 645,649 ****
# XXX Figure out why, and/or see if some other way of summarizing
# XXX this info has greater benefit.
! yield "skip:%c %d" % (word[0], n // 10 * 10)
if has_highbit_char(word):
hicount = 0
--- 645,650 ----
# XXX Figure out why, and/or see if some other way of summarizing
# XXX this info has greater benefit.
! if options.generate_long_skips:
! yield "skip:%c %d" % (word[0], n // 10 * 10)
if has_highbit_char(word):
hicount = 0