[Spambayes-checkins] spambayes tokenizer.py,1.56,1.57

Tue Oct 29 03:44:00 2002

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24468

Modified Files:
	tokenizer.py 
Log Message:
Try to repair the case where legit base64 is followed by random plain
text.  Python's base64 decoder is actually extremely forgiving, so much
so that it skips over any number of garbage lines looking for more
base64 to decode.  So when the base64 part *doesn't* end with '=' padding,
it goes nuts, effectively be too forgiving.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.56
retrieving revision 1.57
diff -C2 -d -r1.56 -r1.57
*** tokenizer.py	28 Oct 2002 20:19:47 -0000	1.56
--- tokenizer.py	29 Oct 2002 03:43:58 -0000	1.57
***************
*** 814,817 ****
--- 814,859 ----
              yield 'content-transfer-encoding:' + x.lower()
  
+ # The base64 decoder is actually very forgiving, but flubs one case:
+ # if no padding is required (no trailing '='), it continues to read
+ # following lines as if they were still part of the base64 part.  We're
+ # actually stricter here.  The *point* is that some mailers tack plain
+ # text on to the end of base64-encoded text sections.
+ 
+ # Match a line of base64, up to & including the trailing newline.
+ # We allow for optional leading and trailing whitespace, and don't care
+ # about line length, but other than that are strict.  Group 1 is non-empty
+ # after a match iff the last significant char on the line is '='; in that
+ # case, it must be the last line of the base64 section.
+ base64_re = re.compile(r"""
+     [ \t]*
+     [a-zA-Z0-9+/]*
+     (=*)
+     [ \t]*
+     \r?
+     \n
+ """, re.VERBOSE)
+ 
+ def try_to_repair_damaged_base64(text):
+     import binascii
+     i = 0
+     while True:
+         # text[:i] looks like base64.  Does the line starting at i also?
+         m = base64_re.match(text, i)
+         if not m:
+             break
+         i = m.end()
+         if m.group(1):
+             # This line has a trailing '=' -- the base64 part is done.
+             break
+     base64text = ''
+     if i:
+         base64 = text[:i]
+         try:
+             base64text = binascii.a2b_base64(base64)
+         except:
+             # There's no point in tokenizing raw base64 gibberish.
+             pass
+     return base64text + text[i:]
+ 
  def breakdown_host(host):
      parts = host.split('.')
***************
*** 1154,1157 ****
--- 1196,1201 ----
                  yield "control: couldn't decode"
                  text = part.get_payload(decode=False)
+                 if text is not None:
+                     text = try_to_repair_damaged_base64(text)
  
              if text is None: