[Spambayes-checkins] spambayes timtest.py,1.12,1.13 tokenizer.py,1.6,1.7

Tim Peters tim_one@users.sourceforge.net
Sun, 08 Sep 2002 14:08:18 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4417

Modified Files:
	timtest.py tokenizer.py 
Log Message:
tokenize_word():  Stopped splitting the y in x@y on '.'.  Improved the
f-n rate.  The big loser for f-p was a message consisting entirely of
"Thanks guys", posted from an x@y address where y had a 0.99 spamprob,
but where y split in pieces had two significantly lower spamprobs.


Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** timtest.py	7 Sep 2002 16:15:45 -0000	1.12
--- timtest.py	8 Sep 2002 21:08:16 -0000	1.13
***************
*** 107,111 ****
          random.seed(hash(directory))
          random.shuffle(all)
!         for fname in all[-500:]:
              yield Msg(directory, fname)
  
--- 107,111 ----
          random.seed(hash(directory))
          random.shuffle(all)
!         for fname in all[-1500:-1000:]:
              yield Msg(directory, fname)
  

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** tokenizer.py	8 Sep 2002 18:54:09 -0000	1.6
--- tokenizer.py	8 Sep 2002 21:08:16 -0000	1.7
***************
*** 569,577 ****
  
          # Don't want to skip embedded email addresses.
          if n < 40 and '.' in word and word.count('@') == 1:
              p1, p2 = word.split('@')
              yield 'email name:' + p1
!             for piece in p2.split('.'):
!                 yield 'email addr:' + piece
  
          # If there are any high-bit chars,
--- 569,578 ----
  
          # Don't want to skip embedded email addresses.
+         # An earlier scheme also split up the y in x@y on '.'.  Not splitting
+         # improved the f-n rate; the f-p rate didn't care either way.
          if n < 40 and '.' in word and word.count('@') == 1:
              p1, p2 = word.split('@')
              yield 'email name:' + p1
!             yield 'email addr:' + p2
  
          # If there are any high-bit chars,