[Python-Dev] python hangs when parsing a bad-formed email
Alberto Casado Martín
alberto.casado.martin at gmail.com
Tue Apr 22 09:43:02 CEST 2008
Hi all,
First of all, sorry if this isn't the list where I have to post this.
And sorry for my english.
As the subject says, I'm having problems with the attached email, when
I try to get a email object reading the attached file, the python
process gets hang and gets all cpu.
I have debuged my code to find where it happens, and I found that is
_parsegen method of the FeedParser class. I know that the email format
is wrong but I don't know why python hangs.
following paste the code showing where hangs.
def _parsegen(self):
# Create a new message and start by parsing headers.
self._new_message()
headers = []
# Collect the headers, searching for a line that doesn't match the RFC
# 2822 header or continuation pattern (including an empty line).
for line in self._input:
if line is NeedMoreData:
yield NeedMoreData
continue
if not headerRE.match(line):
# If we saw the RFC defined header/body separator
# (i.e. newline), just throw it away. Otherwise the line is
# part of the body so push it back.
if not NLCRE.match(line):
self._input.unreadline(line)
break
headers.append(line)
# Done with the headers, so parse them and figure out what we're
# supposed to see in the body of the message.
self._parse_headers(headers)
# Headers-only parsing is a backwards compatibility hack, which was
# necessary in the older parser, which could throw errors. All
# remaining lines in the input are thrown into the message body.
if self._headersonly:
lines = []
while True:
line = self._input.readline()
if line is NeedMoreData:
yield NeedMoreData
continue
if line == '':
break
lines.append(line)
self._cur.set_payload(EMPTYSTRING.join(lines))
return
if self._cur.get_content_type() == 'message/delivery-status':
!!!!!! AT THIS POINT HANGS, AND STRAT TO GET ALL CPU FOR THE PROCESS
# message/delivery-status contains blocks of headers separated by
# a blank line. We'll represent each header block as a separate
# nested message object, but the processing is a bit different
# than standard message/* types because there is no body for the
# nested messages. A blank line separates the subparts.
...
...
...
I have workaround the problem adding this line in _parse_headers method
def _parse_headers(self, lines):
# Passed a list of lines that make up the headers for the current msg
lastheader = ''
lastvalue = []
for lineno, line in enumerate(lines):
# Check for continuation
if line[0] in ' \t':
if not lastheader:
# The first line of the headers was a continuation. This
# is illegal, so let's note the defect, store the illegal
# line, and ignore it for purposes of headers.
defect = errors.FirstHeaderLineIsContinuationDefect(line)
self._cur.defects.append(defect)
continue
if line.strip()!='': !!!!!!! IF THE CONTINUATION LINE
IS NOT EMPTY ADD THE LINE TO THE HEADER.
lastvalue.append(line)
continue
if lastheader:
...
...
...
I don't know why it hangs and I'm not sure why with this line works......
I have tried to parse this email in python 2.3.3 SunOs, python 2.3.3 gcc
python 2.5.1 SunOs,gcc, Windows Xp, and linux SUSE 10. And I have
alway the same result.
bash-3.00$ python
Python 2.5.1 (r251:54863, Feb 28 2008, 07:48:25)
[GCC 3.4.6] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> fp = open('raro.txt')
>>> mail = email.message_from_file(fp)
never return............
I don't know if someone can tell me what is happening....
Best Regards.
Alberto Casado.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: raro.txt
URL: <http://mail.python.org/pipermail/python-dev/attachments/20080422/46c9e000/attachment-0001.txt>
More information about the Python-Dev
mailing list