how come the child's tail affects the parent? does the tail attribute reach up to the parent?
if so, will this be better?:
for event, element in iterparse(f, tag="bla"):
yield element
for child in element:
child.clear() # this might reach to it's parent, which is bla, which is ok because it's an 'end' event.
I can probably also clear the children of previously parsed siblings of different tags (nephews)..
as you see, I'm looking at ways to process a file that doesn't fit into memory.
thanks for the help, Alon Horev
On Wed, Jun 13, 2012 at 11:20 PM, Stefan Behnel
<stefan_ml@behnel.de> wrote:
[fixed top-posting and code formatting]
Note that it's better to send plain text messages when posting to public
mailing lists than HTML formatted messages.
Alon Horev, 13.06.2012 20:49:
> On Wed, Jun 13,
2012 at 9:30 PM, Stefan Behnel wrote:
>> Alon Horev, 13.06.2012 20:16:
>>> from lxml.etree import iterparse
>>>
>>> def safe_iterparse(*args, **kwargs):
>>> for event, element in iterparse(*args, **kwargs):
>>> try:
>>> yield (event, element)
>>> finally:
>>> element.clear()
>>
>> This is a known limitation of the current implementation:
>>
>>
http://lxml.de/parsing.html#modifying-the-tree
>
> the doc does warn: 'You should also avoid moving or discarding the element
> itself.'
> but the example does exactly what I do, which is to clear the element after
> the 'end' event. isn't the example contradicting the warning?
>
> >>> for event, element in etree.iterparse(StringIO(xml)):
> ... # ... do something with the element
> ... element.clear() # clean up children
> ... while element.getprevious() is not None:
> ... del element.getparent()[0] # clean up preceding siblings
Ah, yes, right. Thanks for catching that. Calling .clean() not only cleans
up the children but also deletes the text content and the *tail* text. That
is the actual problem with this code, because it touches tree state after
the current (or latest) element.
I think it would be helpful to add a "with_tail" option to clear() for now.