[Python-Dev] python hangs when parsing a bad-formed email
Alberto Casado Martín
alberto.casado.martin at gmail.com
Tue Apr 22 09:43:02 CEST 2008
First of all, sorry if this isn't the list where I have to post this.
And sorry for my english.
As the subject says, I'm having problems with the attached email, when
I try to get a email object reading the attached file, the python
process gets hang and gets all cpu.
I have debuged my code to find where it happens, and I found that is
_parsegen method of the FeedParser class. I know that the email format
is wrong but I don't know why python hangs.
following paste the code showing where hangs.
# Create a new message and start by parsing headers.
headers = 
# Collect the headers, searching for a line that doesn't match the RFC
# 2822 header or continuation pattern (including an empty line).
for line in self._input:
if line is NeedMoreData:
if not headerRE.match(line):
# If we saw the RFC defined header/body separator
# (i.e. newline), just throw it away. Otherwise the line is
# part of the body so push it back.
if not NLCRE.match(line):
# Done with the headers, so parse them and figure out what we're
# supposed to see in the body of the message.
# Headers-only parsing is a backwards compatibility hack, which was
# necessary in the older parser, which could throw errors. All
# remaining lines in the input are thrown into the message body.
lines = 
line = self._input.readline()
if line is NeedMoreData:
if line == '':
if self._cur.get_content_type() == 'message/delivery-status':
!!!!!! AT THIS POINT HANGS, AND STRAT TO GET ALL CPU FOR THE PROCESS
# message/delivery-status contains blocks of headers separated by
# a blank line. We'll represent each header block as a separate
# nested message object, but the processing is a bit different
# than standard message/* types because there is no body for the
# nested messages. A blank line separates the subparts.
I have workaround the problem adding this line in _parse_headers method
def _parse_headers(self, lines):
# Passed a list of lines that make up the headers for the current msg
lastheader = ''
lastvalue = 
for lineno, line in enumerate(lines):
# Check for continuation
if line in ' \t':
if not lastheader:
# The first line of the headers was a continuation. This
# is illegal, so let's note the defect, store the illegal
# line, and ignore it for purposes of headers.
defect = errors.FirstHeaderLineIsContinuationDefect(line)
if line.strip()!='': !!!!!!! IF THE CONTINUATION LINE
IS NOT EMPTY ADD THE LINE TO THE HEADER.
I don't know why it hangs and I'm not sure why with this line works......
I have tried to parse this email in python 2.3.3 SunOs, python 2.3.3 gcc
python 2.5.1 SunOs,gcc, Windows Xp, and linux SUSE 10. And I have
alway the same result.
Python 2.5.1 (r251:54863, Feb 28 2008, 07:48:25)
[GCC 3.4.6] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> fp = open('raro.txt')
>>> mail = email.message_from_file(fp)
I don't know if someone can tell me what is happening....
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
More information about the Python-Dev