[Spambayes-checkins] spambayes mboxutils.py,1.4,1.5

Tim Peters tim_one@users.sourceforge.net
Tue Nov 12 23:12:14 2002


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31150

Modified Files:
	mboxutils.py 
Log Message:
New utility function extract_headers(), for very simple-minded header
extraction.


Index: mboxutils.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxutils.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** mboxutils.py	6 Nov 2002 01:57:39 -0000	1.4
--- mboxutils.py	12 Nov 2002 23:12:11 -0000	1.5
***************
*** 25,28 ****
--- 25,29 ----
  import mailbox
  import email.Message
+ import re
  
  class DirOfTxtFileMailbox:
***************
*** 119,120 ****
--- 120,164 ----
          msg.set_payload(obj)
      return msg
+ 
+ header_break_re = re.compile(r"\r?\n(\r?\n)")
+ 
+ def extract_headers(text):
+     """Very simple-minded header extraction:  prefix of text up to blank line.
+ 
+     A blank line is recognized via two adjacent line-ending sequences, where
+     a line-ending sequence is a newline optionally preceded by a carriage
+     return.
+ 
+     If no blank line is found, all of text is considered to be a potential
+     header section.  If a blank line is found, the text up to (but not
+     including) the blank line is considered to be a potential header section.
+ 
+     The potential header section is returned, unless it doesn't contain a
+     colon, in which case an empty string is returned.
+ 
+     >>> extract_headers("abc")
+     ''
+     >>> extract_headers("abc\\n\\n\\n")  # no colon
+     ''
+     >>> extract_headers("abc: xyz\\n\\n\\n")
+     'abc: xyz\\n'
+     >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n")
+     'abc: xyz\\r\\n'
+     >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish")
+     'a: b\\ngibberish\\n'
+     """
+ 
+     m = header_break_re.search(text)
+     if m:
+         eol = m.start(1)
+         text = text[:eol]
+     if ':' not in text:
+         text = ""
+     return text
+ 
+ def _test():
+     import doctest, mboxutils
+     return doctest.testmod(mboxutils)
+ 
+ if __name__ == "__main__":
+     _test()





More information about the Spambayes-checkins mailing list