[lxml-dev] Remarks on implementing iterparse()

Hi all, since I won't have the time to implement iterparse() any time soon, here's a proposal on how it should be implemented, in case someone wants to take a shot at it. "iterparse" will be (or will return) an iterable object, let's call it IterParse for clarity. A class is basically the only way of implementing iterators in Pyrex. For the internal SAX part, IterParse will likely work a lot like lxml.sax.ElementTreeContentHandler. We'd need a custom wrapper to the default libxml2 SAX handler to intercept the parse events (this means implementing C helper functions for the SAX events) /after/ they were processed by libxml2. See xmlSAXVersion (SAX2) on how to retrieve the SAX2 default parser structure. IterParse should pass chunks into the parser and buffer the events it receives. When its __next__() method is called, it returns one event or passes new chunks until there is an event to return. This is needed as IterParse has to convert between libxml2 push (SAX) and Python pull (iter). As for the input to the libxml2 parser, there are two possible ways: one is to pass data chunks in through xmlParseChunk and the other is to use xmlCreateIOParserCtxt and implement xmlInputReadCallback (xmlio.h) to have libxml2 request data by itself. Python events (start, end, start-ns, end-ns) are created as follows: * "*-ns" events must be extracted from the libxml2 xmlSAX2StartElementNs call (passed in arguments "prefix"/"URI" and the char* array "namespaces"). They must be stored on a stack to build the respective "end-ns" events. * "start" is somewhat tricky, as it would be a bad idea to allow modifications of the XML structure during that iterator cycle. Maybe it's enough to document that, but there may be ways to crash lxml with certain tree operations. Note also that care has to be taken to prevent Python from garbage collecting the element before the "end" event. The best way to do that is to store a Python reference to that element on a stack. * "end" is simple then: pop the element from the stack and return it. That's all I can come up with so far. So, if anyone is interested in taking a look at it, I'd be glad to hear about it. :) Stefan

Hi Stefan, Haven't read your whole proposal yet, but I believe that libxml2 also offers a newer 'reader' interface besides the SAX interface that we may want to consider for implementing iterparse. It's based on the C# xmlReader interface and uses an iterator based approach already. It might therefore be a better match for iterparse() implementation than SAX. Unfortunately xmlsoft.org looks unreachable at the moment, but I found a slide on it: http://veillard.com/Talks/2003Guadec/slide5-1.html Regards, Martijn

Hi Martijn, Martijn Faassen wrote:
Haven't read your whole proposal yet, but I believe that libxml2 also offers a newer 'reader' interface besides the SAX interface that we may want to consider for implementing iterparse. It's based on the C# xmlReader interface and uses an iterator based approach already. It might therefore be a better match for iterparse() implementation than SAX.
Yup, I considered that after I had checked that libxml2's SAX parser builds a tree step-by-step exactly the way iterparse wants it. What I did not like about XmlTextReader in this context: * the interface forces us to do everything on our own: build node instances, add attributes, etc. * "Note, however that the node instance returned by the Expand() call is only valid until the next Read() operation." (xmlreader.html) - segfault included! * readers have an "expand" command that expands the entire subtree of the current node to retrieve a node reference. iterparse does neither want this nor need this. So, I'm pretty convinced it's easier to use SAX the way I proposed. iterparse is so SAX-like that implementing it on top of a tree-building SAX parser should be easiest.
Unfortunately xmlsoft.org looks unreachable at the moment,
I usually go for file:///usr/share/doc/packages/libxml2-devel/html/ in these cases. :) There's a file "xmlreader.html" in there, which describes the interface to a certain extend. Regards, Stefan

Stefan Behnel wrote:
* "*-ns" events must be extracted from the libxml2 xmlSAX2StartElementNs call (passed in arguments "prefix"/"URI" and the char* array "namespaces"). They must be stored on a stack to build the respective "end-ns" events.
footnote: ET guarantees that start-ns and end-ns events nest properly. I don't know how libxml2 handles this, but the SAX specification explicitly says that end events may appear out of order: For elements with multiple namespace declarations, the startPrefixMapping() calls won't necessarily nest with the endPrefixMapping() because those endPrefixMapping() calls may be made in any order. assuming that libxml2 isn't doing something really strange here, using a stack should take care of this. </F>

Hi Fredrik, Fredrik Lundh wrote:
Stefan Behnel wrote:
* "*-ns" events must be extracted from the libxml2 xmlSAX2StartElementNs call (passed in arguments "prefix"/"URI" and the char* array "namespaces"). They must be stored on a stack to build the respective "end-ns" events.
footnote: ET guarantees that start-ns and end-ns events nest properly.
Hmm, does it? As far as I can see, ET's end-ns events always return None for the element, so there is no visible nesting between start-ns and end-ns. end-ns events are sufficiently semantic free to allow generation in arbitrary order. Maybe you just meant that all *-ns events should be outside the related start/end events, not (partially) inside? My footnote: The current implementation in lxml doesn't do the above anyway. It simply traverses the namespace declarations of the new element and builds an event for each of them. So it only has to intercept the startElementNs and endElementNs SAX events. Stefan
participants (3)
-
Fredrik Lundh
-
Martijn Faassen
-
Stefan Behnel