Hi, please avoid top-posting. Alon Horev, 13.06.2012 23:10:
On Wed, Jun 13, 2012 at 11:20 PM, Stefan Behnel wrote:
Alon Horev, 13.06.2012 20:49:
for event, element in etree.iterparse(StringIO(xml)): ... # ... do something with the element ... element.clear() # clean up children ... while element.getprevious() is not None: ... del element.getparent()[0] # clean up preceding siblings
Ah, yes, right. Thanks for catching that. Calling .clean() not only cleans up the children but also deletes the text content and the *tail* text. That is the actual problem with this code, because it touches tree state after the current (or latest) element.
how come the child's tail affects the parent? does the tail attribute reach up to the parent?
During incremental parsing, the parser needs to hold on to the last node that it parsed in order to be able to continue from it. That node may be a text node in the internal tree, and it can happen that it's the tail text of the last element that it parsed. Clearing that element will then delete that tail text and leave the parser in an illegal state. I've been considering for a while now to set a flag on the document while it is being iterparsed. The tree modification methods could then take that into account. I think it makes sense to eventually do that. It would make iterparse() much safer and shouldn't imply much of a performance impact. It's not entirely thought through yet, though...
if so, will this be better?:
for event, element in iterparse(f, tag="bla"): yield element for child in element: child.clear() # this might reach to it's parent, which is bla, which is ok because it's an 'end' event.
You can safely do this:
for event, element in etree.iterparse(f, tag="bla"): ... yield element ... del element[:] # discard children ... while element.getprevious() is not None: ... del element.getparent()[0] # clean up preceding siblings
as you see, I'm looking at ways to process a file that doesn't fit into memory.
That's the obvious use case. Stefan