[Spambayes-checkins] spambayes Options.py,1.61,1.62
tokenizer.py,1.52,1.53
Anthony Baxter
anthonybaxter@users.sourceforge.net
Mon Oct 28 07:04:15 2002
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv23052
Modified Files:
Options.py tokenizer.py
Log Message:
Added skip_max_word_size as an option, to specify how long a word has to be
before it's skipped. I find that boosting from 12 (the default) to 20 makes
a significant improvement in the number of 'unsure' messages. see my post to
the list for more.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.61
retrieving revision 1.62
diff -C2 -d -r1.61 -r1.62
*** Options.py 27 Oct 2002 22:56:15 -0000 1.61
--- Options.py 28 Oct 2002 07:04:12 -0000 1.62
***************
*** 91,94 ****
--- 91,100 ----
generate_long_skips: True
+ #
+ # Length of words that triggers 'long skips'. Longer than this
+ # triggers a skip.
+ #
+ skip_max_word_size: 12
+
# Generate tokens which resemble the posting time in 6-minute buckets:
# int((h*60+m)/10).
***************
*** 170,174 ****
# Display spam when
# show_spam_lo <= spamprob <= show_spam_hi
! # and likewise for ham. The defaults here do not show anything.
show_spam_lo: 1.0
show_spam_hi: 0.0
--- 176,180 ----
# Display spam when
# show_spam_lo <= spamprob <= show_spam_hi
! # and likewise for ham. The defaults here do not show anything.
show_spam_lo: 1.0
show_spam_hi: 0.0
***************
*** 311,314 ****
--- 317,321 ----
'count_all_header_lines': boolean_cracker,
'generate_long_skips': boolean_cracker,
+ 'skip_max_word_size': int_cracker,
'extract_dow': boolean_cracker,
'generate_time_buckets': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.52
retrieving revision 1.53
diff -C2 -d -r1.52 -r1.53
*** tokenizer.py 27 Oct 2002 22:34:08 -0000 1.52
--- tokenizer.py 28 Oct 2002 07:04:12 -0000 1.53
***************
*** 589,596 ****
yield "fname piece:" + piece
! def tokenize_word(word, _len=len):
n = _len(word)
# Make sure this range matches in tokenize().
! if 3 <= n <= 12:
yield word
--- 589,596 ----
yield "fname piece:" + piece
! def tokenize_word(word, _len=len, maxword=options.skip_max_word_size):
n = _len(word)
# Make sure this range matches in tokenize().
! if 3 <= n <= maxword:
yield word