[Spambayes-checkins] spambayes tokenizer.py,1.64,1.65
Tim Peters
tim_one@users.sourceforge.net
Mon Nov 11 23:26:21 2002
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10237
Modified Files:
tokenizer.py
Log Message:
An idea from Anthony Baxter: decode Subject lines, so that they're
tokenized in decoded form, and so that they generate charset tokens too.
This had minor good effects in both our tests.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.64
retrieving revision 1.65
diff -C2 -d -r1.64 -r1.65
*** tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64
--- tokenizer.py 11 Nov 2002 23:26:18 -0000 1.65
***************
*** 5,8 ****
--- 5,9 ----
import email
+ import email.Header
import email.Message
import email.Errors
***************
*** 1054,1062 ****
# but real benefit to keeping case intact in this specific context.
x = msg.get('subject', '')
! for w in subject_word_re.findall(x):
! for t in tokenize_word(w):
! yield 'subject:' + t
! for w in punctuation_run_re.findall(x):
! yield 'subject:' + w
# Dang -- I can't use Sender:. If I do,
--- 1055,1066 ----
# but real benefit to keeping case intact in this specific context.
x = msg.get('subject', '')
! for x, subjcharset in email.Header.decode_header(x):
! if subjcharset is not None:
! yield 'subjectcharset:' + subjcharset
! for w in subject_word_re.findall(x):
! for t in tokenize_word(w):
! yield 'subject:' + t
! for w in punctuation_run_re.findall(x):
! yield 'subject:' + w
# Dang -- I can't use Sender:. If I do,
More information about the Spambayes-checkins
mailing list