New subject: [lxml-dev] Re: Remarks on implementing iterparse()

May 7, 2006

      Hi all,

since I won't have the time to implement iterparse() any time soon, here's a
proposal on how it should be implemented, in case someone wants to take a shot
at it.

"iterparse" will be (or will return) an iterable object, let's call it
IterParse for clarity. A class is basically the only way of implementing
iterators in Pyrex. For the internal SAX part, IterParse will likely work a
lot like lxml.sax.ElementTreeContentHandler.

We'd need a custom wrapper to the default libxml2 SAX handler to intercept the
parse events (this means implementing C helper functions for the SAX events)
/after/ they were processed by libxml2. See xmlSAXVersion (SAX2) on how to
retrieve the SAX2 default parser structure.

IterParse should pass chunks into the parser and buffer the events it
receives. When its __next__() method is called, it returns one event or passes
new chunks until there is an event to return. This is needed as IterParse has
to convert between libxml2 push (SAX) and Python pull (iter).

As for the input to the libxml2 parser, there are two possible ways: one is to
pass data chunks in through xmlParseChunk and the other is to use
xmlCreateIOParserCtxt and implement xmlInputReadCallback (xmlio.h) to have
libxml2 request data by itself.

Python events (start, end, start-ns, end-ns) are created as follows:

* "*-ns" events must be extracted from the libxml2 xmlSAX2StartElementNs call
(passed in arguments "prefix"/"URI" and the char* array "namespaces"). They
must be stored on a stack to build the respective "end-ns" events.

* "start" is somewhat tricky, as it would be a bad idea to allow modifications
of the XML structure during that iterator cycle. Maybe it's enough to document
that, but there may be ways to crash lxml with certain tree operations. Note
also that care has to be taken to prevent Python from garbage collecting the
element before the "end" event. The best way to do that is to store a Python
reference to that element on a stack.

* "end" is simple then: pop the element from the stack and return it.

That's all I can come up with so far. So, if anyone is interested in taking a
look at it, I'd be glad to hear about it. :)

Stefan

[lxml-dev] Remarks on implementing iterparse()

Stefan Behnel

Martijn Faassen

Stefan Behnel

Fredrik Lundh

Stefan Behnel

tags

participants (3)