[Spambayes-checkins] spambayes timtest.py,1.12,1.13
tokenizer.py,1.6,1.7
Tim Peters
tim_one@users.sourceforge.net
Sun, 08 Sep 2002 14:08:18 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4417
Modified Files:
timtest.py tokenizer.py
Log Message:
tokenize_word(): Stopped splitting the y in x@y on '.'. Improved the
f-n rate. The big loser for f-p was a message consisting entirely of
"Thanks guys", posted from an x@y address where y had a 0.99 spamprob,
but where y split in pieces had two significantly lower spamprobs.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** timtest.py 7 Sep 2002 16:15:45 -0000 1.12
--- timtest.py 8 Sep 2002 21:08:16 -0000 1.13
***************
*** 107,111 ****
random.seed(hash(directory))
random.shuffle(all)
! for fname in all[-500:]:
yield Msg(directory, fname)
--- 107,111 ----
random.seed(hash(directory))
random.shuffle(all)
! for fname in all[-1500:-1000:]:
yield Msg(directory, fname)
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** tokenizer.py 8 Sep 2002 18:54:09 -0000 1.6
--- tokenizer.py 8 Sep 2002 21:08:16 -0000 1.7
***************
*** 569,577 ****
# Don't want to skip embedded email addresses.
if n < 40 and '.' in word and word.count('@') == 1:
p1, p2 = word.split('@')
yield 'email name:' + p1
! for piece in p2.split('.'):
! yield 'email addr:' + piece
# If there are any high-bit chars,
--- 569,578 ----
# Don't want to skip embedded email addresses.
+ # An earlier scheme also split up the y in x@y on '.'. Not splitting
+ # improved the f-n rate; the f-p rate didn't care either way.
if n < 40 and '.' in word and word.count('@') == 1:
p1, p2 = word.split('@')
yield 'email name:' + p1
! yield 'email addr:' + p2
# If there are any high-bit chars,