[Spambayes-checkins] spambayes tokenizer.py,1.65,1.66 Options.py,1.68,1.69

Anthony Baxter anthonybaxter@users.sourceforge.net
Tue Nov 12 06:21:41 2002


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16090

Modified Files:
	tokenizer.py Options.py 
Log Message:
New tokenizer option 'address_headers'. Allows the mining of headers 
other than 'from' for email addresses and names (e.g. to or cc). 

By default, it's just set to 'from' for now.

In addition, address headers (including from) now get decoded and parsed
correctly, rather than by a whitespace split.

This shows a quite nice improvement for me.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.65
retrieving revision 1.66
diff -C2 -d -r1.65 -r1.66
*** tokenizer.py	11 Nov 2002 23:26:18 -0000	1.65
--- tokenizer.py	12 Nov 2002 06:21:38 -0000	1.66
***************
*** 7,10 ****
--- 7,12 ----
  import email.Header
  import email.Message
+ import email.Header
+ import email.Utils
  import email.Errors
  import re
***************
*** 1072,1082 ****
          #               # one (smalls wins & losses across runs, overall
          #               # not significant), so leaving it out
!         for field in ('from',):
!             prefix = field + ':'
!             x = msg.get(field, 'none').lower()
!             for w in x.split():
!                 for t in tokenize_word(w):
!                     yield prefix + t
! 
          # To:
          # Cc:
--- 1074,1096 ----
          #               # one (smalls wins & losses across runs, overall
          #               # not significant), so leaving it out
!         # To:, Cc:      # These can help, if your ham and spam are sourced
!         #               # from the same location. If not, they'll be horrible.
!         for field in options.address_headers:
!             addrlist = msg.get_all(field, [])
!             if not addrlist:
!                 yield field + ":none"
!             for addrs in addrlist:
!                 for rname,ename in email.Utils.getaddresses([addrs]):
!                     if rname:
!                         for rname,rcharset in email.Header.decode_header(rname):
!                             for w in rname.lower().split():
!                                 for t in tokenize_word(w):
!                                     yield field+'realname:'+t
!                             if rcharset is not None:
!                                 yield field+'charset:'+rcharset
!                     if ename:
!                         for w in ename.lower().split('@'):
!                             for t in tokenize_word(w):
!                                 yield field+'email:'+t
          # To:
          # Cc:

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.68
retrieving revision 1.69
diff -C2 -d -r1.68 -r1.69
*** Options.py	11 Nov 2002 01:59:06 -0000	1.68
--- Options.py	12 Nov 2002 06:21:38 -0000	1.69
***************
*** 90,93 ****
--- 90,101 ----
  mine_received_headers: False
  
+ # Mine the following address headers. If you have mixed source corpuses
+ # (as opposed to a mixed sauce walrus, which is delicious!) then you
+ # probably don't want to use 'to' or 'cc')
+ # Address headers will be decoded, and will generate charset tokens as
+ # well as the real address.
+ # others to consider: to, cc, reply-to, errors-to, sender, ...
+ address_headers: from
+ 
  # If legitimate mail contains things that look like text to the tokenizer
  # and turning turning off this option helps (perhaps binary attachments get
***************
*** 340,343 ****
--- 348,352 ----
  all_options = {
      'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
+                   'address_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,





More information about the Spambayes-checkins mailing list