[Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.26,1.27

Tim Peters tim_one@users.sourceforge.net
Tue Nov 12 23:33:48 2002


Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv11116/Outlook2000

Modified Files:
	msgstore.py 
Log Message:
_GetMessageText():  Whatever the value of the headers property, stop
paying attention to it after the first blank line, and don't believe it
at all if it doesn't contain a colon.  Cheap trick to worm around the
problems some people have reported with Outlook returning multiple header
sections here (including internal MIME armor with empty bodies).


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** msgstore.py	12 Nov 2002 23:19:33 -0000	1.26
--- msgstore.py	12 Nov 2002 23:33:45 -0000	1.27
***************
*** 1,5 ****
  from __future__ import generators
  
! import sys, os
  
  try:
--- 1,5 ----
  from __future__ import generators
  
! import sys, os, re
  
  try:
***************
*** 10,13 ****
--- 10,53 ----
  
  
+ # XXX
+ # import mboxutils  doesn't work at this point.  The extract_headers function
+ # here is a copy-and-paste.
+ header_break_re = re.compile(r"\r?\n(\r?\n)")
+ 
+ def extract_headers(text):
+     """Very simple-minded header extraction:  prefix of text up to blank line.
+ 
+     A blank line is recognized via two adjacent line-ending sequences, where
+     a line-ending sequence is a newline optionally preceded by a carriage
+     return.
+ 
+     If no blank line is found, all of text is considered to be a potential
+     header section.  If a blank line is found, the text up to (but not
+     including) the blank line is considered to be a potential header section.
+ 
+     The potential header section is returned, unless it doesn't contain a
+     colon, in which case an empty string is returned.
+ 
+     >>> extract_headers("abc")
+     ''
+     >>> extract_headers("abc\\n\\n\\n")  # no colon
+     ''
+     >>> extract_headers("abc: xyz\\n\\n\\n")
+     'abc: xyz\\n'
+     >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n")
+     'abc: xyz\\r\\n'
+     >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish")
+     'a: b\\ngibberish\\n'
+     """
+ 
+     m = header_break_re.search(text)
+     if m:
+         eol = m.start(1)
+         text = text[:eol]
+     if ':' not in text:
+         text = ""
+     return text
+ 
+ 
  # Abstract definition - can be moved out when we have more than one sub-class <wink>
  # External interface to this module is almost exclusively via a "folder ID"
***************
*** 384,387 ****
--- 424,434 ----
          html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
          has_attach = data[3][1]
+ 
+         # Some Outlooks deliver a strange notion of headers, including
+         # interior MIME armor.  To prevent later errors, try to get rid
+         # of stuff now that can't possibly be parsed as "real" (SMTP)
+         # headers.
+         headers = extract_headers(headers)
+ 
          # Mail delivered internally via Exchange Server etc may not have
          # headers - fake some up.
***************
*** 392,395 ****
--- 439,443 ----
          elif headers.startswith("Microsoft Mail"):
              headers = "X-MS-Mail-Gibberish: " + headers
+ 
          if not html and not body:
              # Only ever seen this for "multipart/signed" messages, so





More information about the Spambayes-checkins mailing list