[Spambayes-checkins] spambayes tokenizer.py,1.64,1.65

Mon Nov 11 23:26:21 2002

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10237

Modified Files:
	tokenizer.py 
Log Message:
An idea from Anthony Baxter:  decode Subject lines, so that they're
tokenized in decoded form, and so that they generate charset tokens too.
This had minor good effects in both our tests.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.64
retrieving revision 1.65
diff -C2 -d -r1.64 -r1.65
*** tokenizer.py	8 Nov 2002 04:06:24 -0000	1.64
--- tokenizer.py	11 Nov 2002 23:26:18 -0000	1.65
***************
*** 5,8 ****
--- 5,9 ----
  
  import email
+ import email.Header
  import email.Message
  import email.Errors
***************
*** 1054,1062 ****
          # but real benefit to keeping case intact in this specific context.
          x = msg.get('subject', '')
!         for w in subject_word_re.findall(x):
!             for t in tokenize_word(w):
!                 yield 'subject:' + t
!         for w in punctuation_run_re.findall(x):
!             yield 'subject:' + w
  
          # Dang -- I can't use Sender:.  If I do,
--- 1055,1066 ----
          # but real benefit to keeping case intact in this specific context.
          x = msg.get('subject', '')
!         for x, subjcharset in email.Header.decode_header(x):
!             if subjcharset is not None:
!                 yield 'subjectcharset:' + subjcharset
!             for w in subject_word_re.findall(x):
!                 for t in tokenize_word(w):
!                     yield 'subject:' + t
!             for w in punctuation_run_re.findall(x):
!                 yield 'subject:' + w
  
          # Dang -- I can't use Sender:.  If I do,