[Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.26,1.27
Tim Peters
tim_one@users.sourceforge.net
Tue Nov 12 23:33:48 2002
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv11116/Outlook2000
Modified Files:
msgstore.py
Log Message:
_GetMessageText(): Whatever the value of the headers property, stop
paying attention to it after the first blank line, and don't believe it
at all if it doesn't contain a colon. Cheap trick to worm around the
problems some people have reported with Outlook returning multiple header
sections here (including internal MIME armor with empty bodies).
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** msgstore.py 12 Nov 2002 23:19:33 -0000 1.26
--- msgstore.py 12 Nov 2002 23:33:45 -0000 1.27
***************
*** 1,5 ****
from __future__ import generators
! import sys, os
try:
--- 1,5 ----
from __future__ import generators
! import sys, os, re
try:
***************
*** 10,13 ****
--- 10,53 ----
+ # XXX
+ # import mboxutils doesn't work at this point. The extract_headers function
+ # here is a copy-and-paste.
+ header_break_re = re.compile(r"\r?\n(\r?\n)")
+
+ def extract_headers(text):
+ """Very simple-minded header extraction: prefix of text up to blank line.
+
+ A blank line is recognized via two adjacent line-ending sequences, where
+ a line-ending sequence is a newline optionally preceded by a carriage
+ return.
+
+ If no blank line is found, all of text is considered to be a potential
+ header section. If a blank line is found, the text up to (but not
+ including) the blank line is considered to be a potential header section.
+
+ The potential header section is returned, unless it doesn't contain a
+ colon, in which case an empty string is returned.
+
+ >>> extract_headers("abc")
+ ''
+ >>> extract_headers("abc\\n\\n\\n") # no colon
+ ''
+ >>> extract_headers("abc: xyz\\n\\n\\n")
+ 'abc: xyz\\n'
+ >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n")
+ 'abc: xyz\\r\\n'
+ >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish")
+ 'a: b\\ngibberish\\n'
+ """
+
+ m = header_break_re.search(text)
+ if m:
+ eol = m.start(1)
+ text = text[:eol]
+ if ':' not in text:
+ text = ""
+ return text
+
+
# Abstract definition - can be moved out when we have more than one sub-class <wink>
# External interface to this module is almost exclusively via a "folder ID"
***************
*** 384,387 ****
--- 424,434 ----
html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
has_attach = data[3][1]
+
+ # Some Outlooks deliver a strange notion of headers, including
+ # interior MIME armor. To prevent later errors, try to get rid
+ # of stuff now that can't possibly be parsed as "real" (SMTP)
+ # headers.
+ headers = extract_headers(headers)
+
# Mail delivered internally via Exchange Server etc may not have
# headers - fake some up.
***************
*** 392,395 ****
--- 439,443 ----
elif headers.startswith("Microsoft Mail"):
headers = "X-MS-Mail-Gibberish: " + headers
+
if not html and not body:
# Only ever seen this for "multipart/signed" messages, so
More information about the Spambayes-checkins
mailing list