[Spambayes-checkins] spambayes Options.py,1.23,1.24
tokenizer.py,1.30,1.31
Skip Montanaro
montanaro@users.sourceforge.net
Sun, 22 Sep 2002 20:13:33 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1989
Modified Files:
Options.py tokenizer.py
Log Message:
Added two new options: check_octets and octet_prefix_size. If check_octets
is True, any application/octet-stream parts will be tokenized simply by
returning octet_prefix_size bytes of the first line of the base64-encoded
stuff. For example, DOS/Windows executables seem to begin with the string
"TVqQA". If enabled, the token "octet:TVqQA" would be returned for such
sections, providing they had the appropriate content type and transfer
encoding.
By default, check_octets is False, preserving preexisting behavior. I can't
test this very well since I've pretty ruthlessly purged viruses from my Spam
corpu.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** Options.py 21 Sep 2002 21:11:50 -0000 1.23
--- Options.py 23 Sep 2002 03:13:30 -0000 1.24
***************
*** 50,53 ****
--- 50,58 ----
ignore_redundant_html: False
+ # If true, the first few characters of application/octet-stream sections
+ # are used, undecoded. What 'few' means is decided by octet_prefix_size.
+ check_octets: False
+ octet_prefix_size: 5
+
# Generate tokens just counting the number of instances of each kind of
# header line, in a case-sensitive way.
***************
*** 193,196 ****
--- 198,203 ----
'count_all_header_lines': boolean_cracker,
'mine_received_headers': boolean_cracker,
+ 'check_octets': boolean_cracker,
+ 'octet_prefix_size': int_cracker,
'basic_header_tokenize': boolean_cracker,
'basic_header_tokenize_only': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** tokenizer.py 22 Sep 2002 06:58:36 -0000 1.30
--- tokenizer.py 23 Sep 2002 03:13:31 -0000 1.31
***************
*** 549,552 ****
--- 549,557 ----
msg.walk()))
+ def octetparts(msg):
+ return Set(filter(lambda part:
+ part.get_content_type() == 'application/octet-stream',
+ msg.walk()))
+
url_re = re.compile(r"""
(https? | ftp) # capture the protocol
***************
*** 992,996 ****
--- 997,1011 ----
part is ignored. Except in special cases, it's recommended to
leave that at its default of false.
+
+ If options.check_octets is True, the first few undecoded characters
+ of application/octet-stream parts of the message body become tokens.
"""
+
+ if options.check_octets:
+ # Find, decode application/octet-stream parts of the body,
+ # tokenizing the first few characters of each chunk
+ for part in octetparts(msg):
+ text = part.get_payload(decode=False)
+ yield "octet:%s" % text[:options.octet_prefix_size]
# Find, decode (base64, qp), and tokenize textual parts of the body.