[Spambayes-checkins] spambayes Options.py,1.23,1.24 tokenizer.py,1.30,1.31

Skip Montanaro montanaro@users.sourceforge.net
Sun, 22 Sep 2002 20:13:33 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1989

Modified Files:
	Options.py tokenizer.py 
Log Message:
Added two new options: check_octets and octet_prefix_size.  If check_octets
is True, any application/octet-stream parts will be tokenized simply by
returning octet_prefix_size bytes of the first line of the base64-encoded
stuff.  For example, DOS/Windows executables seem to begin with the string
"TVqQA".  If enabled, the token "octet:TVqQA" would be returned for such
sections, providing they had the appropriate content type and transfer
encoding.

By default, check_octets is False, preserving preexisting behavior.  I can't
test this very well since I've pretty ruthlessly purged viruses from my Spam
corpu.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** Options.py	21 Sep 2002 21:11:50 -0000	1.23
--- Options.py	23 Sep 2002 03:13:30 -0000	1.24
***************
*** 50,53 ****
--- 50,58 ----
  ignore_redundant_html: False
  
+ # If true, the first few characters of application/octet-stream sections
+ # are used, undecoded.  What 'few' means is decided by octet_prefix_size.
+ check_octets: False
+ octet_prefix_size: 5
+ 
  # Generate tokens just counting the number of instances of each kind of
  # header line, in a case-sensitive way.
***************
*** 193,196 ****
--- 198,203 ----
                    'count_all_header_lines': boolean_cracker,
                    'mine_received_headers': boolean_cracker,
+                   'check_octets': boolean_cracker,
+                   'octet_prefix_size': int_cracker,
                    'basic_header_tokenize': boolean_cracker,
                    'basic_header_tokenize_only': boolean_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** tokenizer.py	22 Sep 2002 06:58:36 -0000	1.30
--- tokenizer.py	23 Sep 2002 03:13:31 -0000	1.31
***************
*** 549,552 ****
--- 549,557 ----
                            msg.walk()))
  
+ def octetparts(msg):
+     return Set(filter(lambda part:
+                       part.get_content_type() == 'application/octet-stream',
+                       msg.walk()))
+ 
  url_re = re.compile(r"""
      (https? | ftp)  # capture the protocol
***************
*** 992,996 ****
--- 997,1011 ----
          part is ignored.  Except in special cases, it's recommended to
          leave that at its default of false.
+ 
+         If options.check_octets is True, the first few undecoded characters
+         of application/octet-stream parts of the message body become tokens.
          """
+ 
+         if options.check_octets:
+             # Find, decode application/octet-stream parts of the body,
+             # tokenizing the first few characters of each chunk
+             for part in octetparts(msg):
+                 text = part.get_payload(decode=False)
+                 yield "octet:%s" % text[:options.octet_prefix_size]
  
          # Find, decode (base64, qp), and tokenize textual parts of the body.