[Python-Dev] Bug: xml.dom.pulldom never gives you END_DOCUMENT events with an Expat parser

Thomas Broyer t.broyer at gmail.com
Mon Jun 12 00:27:00 CEST 2006


Hi,

First, sorry for posting this here, I closed my SourceForge account a
few months ago and I can't get it reopened...

I'm using python 2.2.1 but a diff on SVN showed that there was no
change at this level, so the following bug should still be there in
current versions (I'll try with a 2.4 at work tomorrow). On my
machine, xml.sax.make_parser returns an
xml.sax.expatreader.ExpatParser instance.

The problem is: I'm never given END_DOCUMENT events.

Code to reproduce:

from xml.dom.pulldom import parseString
reader = parseString('<element attribute="value">text</element>')
# The following 2 lines will produce, in order:
# START_DOCUMENT, START_ELEMENT, TEXT, END_ELEMENT
# Note the lack of the END_DOCUMENT event.
for event,node in reader:
   print event
# The following line will get an END_DOCUMENT event
print reader.getEvent()[0]
# The following line will throw a SAXParseException,
# because the SAX parser's close method has been
# called twice
print reader.getEvent()[0]


Cause:

The xml.dom.pulldom.DOMEventStream.getEvent method, when it has no
more event in its internal stack, calls the SAX parser's close()
method (which is OK) then immediately returns 'None', ignoring any
event that could have been generated by the call to the close()
method. If you call getEvent later, it will send you the remaining
events until there are no more left, and then will call the SAX
parser's close() method again, causing a SAXParseException.
Because expat (an maybe other parsers too) has no way to know when the
document ends, it generates the endDocument/END_DOCUMENT event only
when explicitely told that the XML chunk is the final one (i.e. when
the close() method is called)


Proposed fix:

Add a "parser_closed" attribute to the DOMEventStream class,
initialized to "False". After having called self.parser.close() in the
xml.dom.pulldom.DOMEventStream.getEvent method, immediately set this
"parser_closed" attribute to True and proceed. Finally, at the
beginning of the "while" loop, immediately returns "None" if
"parser_closed" is "True" to prevent a second call to
self.parser.close().
With this change, any call to getEvent when there are no event left
will return None and never throw an exception, which I think is the
expected behavior.


Proposed code:

The "closed" attribute is initialized in the "__init__" method:
    def __init__(self, stream, parser, bufsize):
        self.stream = stream
        self.parser = parser
        self.parser_closed = False
        self.bufsize = bufsize
        if not hasattr(self.parser, 'feed'):
            self.getEvent = self._slurp
        self.reset()

The "getEvent" method becomes:
    def getEvent(self):
        # use IncrementalParser interface, so we get the desired
        # pull effect
        if not self.pulldom.firstEvent[1]:
            self.pulldom.lastEvent = self.pulldom.firstEvent
        while not self.pulldom.firstEvent[1]:
            if self.parser_closed:
                return None
            buf = self.stream.read(self.bufsize)
            if buf:
                self.parser.feed(buf)
            else:
                self.parser.close()
                self.parser_closed = True
        rc = self.pulldom.firstEvent[1][0]
        self.pulldom.firstEvent[1] = self.pulldom.firstEvent[1][1]
        return rc

The same problem seems to exist in the
xml.dom.pulldom.DOMEventStream._slurp method, when the SAX parser is
not an IncrementalParser, as the parser's close() method is never
called. I suggest adding a call to the close() method in there.
However, as I only have expat as an option, which implements
IncrementalParser, I can't test it...
The _slurp method would become:
    def _slurp(self):
        """ Fallback replacement for getEvent() using the
            standard SAX2 interface, which means we slurp the
            SAX events into memory (no performance gain, but
            we are compatible to all SAX parsers).
        """
        self.parser.parse(self.stream)
        self.parser.close()
        self.getEvent = self._emit
        return self._emit()
The _emit method raises exceptions when there are no events left, so I
propose changing it to:
    def _emit(self):
        """ Fallback replacement for getEvent() that emits
            the events that _slurp() read previously.
        """
        if not self.pulldom.firstEvent[1]:
            return None
        rc = self.pulldom.firstEvent[1][0]
        self.pulldom.firstEvent[1] = self.pulldom.firstEvent[1][1]
        return rc

Hope this helps.

-- 
Thomas Broyer


More information about the Python-Dev mailing list