[Python-Dev] Bug: xml.dom.pulldom never gives you END_DOCUMENT events with an Expat parser
Thomas Broyer
t.broyer at gmail.com
Mon Jun 12 00:27:00 CEST 2006
Hi,
First, sorry for posting this here, I closed my SourceForge account a
few months ago and I can't get it reopened...
I'm using python 2.2.1 but a diff on SVN showed that there was no
change at this level, so the following bug should still be there in
current versions (I'll try with a 2.4 at work tomorrow). On my
machine, xml.sax.make_parser returns an
xml.sax.expatreader.ExpatParser instance.
The problem is: I'm never given END_DOCUMENT events.
Code to reproduce:
from xml.dom.pulldom import parseString
reader = parseString('<element attribute="value">text</element>')
# The following 2 lines will produce, in order:
# START_DOCUMENT, START_ELEMENT, TEXT, END_ELEMENT
# Note the lack of the END_DOCUMENT event.
for event,node in reader:
print event
# The following line will get an END_DOCUMENT event
print reader.getEvent()[0]
# The following line will throw a SAXParseException,
# because the SAX parser's close method has been
# called twice
print reader.getEvent()[0]
Cause:
The xml.dom.pulldom.DOMEventStream.getEvent method, when it has no
more event in its internal stack, calls the SAX parser's close()
method (which is OK) then immediately returns 'None', ignoring any
event that could have been generated by the call to the close()
method. If you call getEvent later, it will send you the remaining
events until there are no more left, and then will call the SAX
parser's close() method again, causing a SAXParseException.
Because expat (an maybe other parsers too) has no way to know when the
document ends, it generates the endDocument/END_DOCUMENT event only
when explicitely told that the XML chunk is the final one (i.e. when
the close() method is called)
Proposed fix:
Add a "parser_closed" attribute to the DOMEventStream class,
initialized to "False". After having called self.parser.close() in the
xml.dom.pulldom.DOMEventStream.getEvent method, immediately set this
"parser_closed" attribute to True and proceed. Finally, at the
beginning of the "while" loop, immediately returns "None" if
"parser_closed" is "True" to prevent a second call to
self.parser.close().
With this change, any call to getEvent when there are no event left
will return None and never throw an exception, which I think is the
expected behavior.
Proposed code:
The "closed" attribute is initialized in the "__init__" method:
def __init__(self, stream, parser, bufsize):
self.stream = stream
self.parser = parser
self.parser_closed = False
self.bufsize = bufsize
if not hasattr(self.parser, 'feed'):
self.getEvent = self._slurp
self.reset()
The "getEvent" method becomes:
def getEvent(self):
# use IncrementalParser interface, so we get the desired
# pull effect
if not self.pulldom.firstEvent[1]:
self.pulldom.lastEvent = self.pulldom.firstEvent
while not self.pulldom.firstEvent[1]:
if self.parser_closed:
return None
buf = self.stream.read(self.bufsize)
if buf:
self.parser.feed(buf)
else:
self.parser.close()
self.parser_closed = True
rc = self.pulldom.firstEvent[1][0]
self.pulldom.firstEvent[1] = self.pulldom.firstEvent[1][1]
return rc
The same problem seems to exist in the
xml.dom.pulldom.DOMEventStream._slurp method, when the SAX parser is
not an IncrementalParser, as the parser's close() method is never
called. I suggest adding a call to the close() method in there.
However, as I only have expat as an option, which implements
IncrementalParser, I can't test it...
The _slurp method would become:
def _slurp(self):
""" Fallback replacement for getEvent() using the
standard SAX2 interface, which means we slurp the
SAX events into memory (no performance gain, but
we are compatible to all SAX parsers).
"""
self.parser.parse(self.stream)
self.parser.close()
self.getEvent = self._emit
return self._emit()
The _emit method raises exceptions when there are no events left, so I
propose changing it to:
def _emit(self):
""" Fallback replacement for getEvent() that emits
the events that _slurp() read previously.
"""
if not self.pulldom.firstEvent[1]:
return None
rc = self.pulldom.firstEvent[1][0]
self.pulldom.firstEvent[1] = self.pulldom.firstEvent[1][1]
return rc
Hope this helps.
--
Thomas Broyer
More information about the Python-Dev
mailing list