segfault - potential double free when using iterparse

Hi All, First of all, I'm not opening a bug yet as I'm not certain whether this is a CPython bug or lxml bug. I'm getting a segfault within python's GC (garbage collector) module. here's the stack trace: #0 0x00007fc7e9f6b76e in gc_list_remove (op=0x7fc79cef3d98) at Modules/gcmodule.c:211 #1 PyObject_GC_Del (op=0x7fc79cef3d98) at Modules/gcmodule.c:1503 #2 0x00007fc7e9f2ac0f in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2894 #3 0x00007fc7e9ea5b79 in gen_send_ex (arg=None, exc=<value optimized out>, gen=<value optimized out>) at Objects/genobject.c:84 #4 0x00007fc7e9ea6185 in gen_close (self=<generator at remote 0x7fc7c1ba73c0>) at Objects/genobject.c:130 #5 gen_del (self=<generator at remote 0x7fc7c1ba73c0>) at Objects/genobject.c:165 #6 0x00007fc7e9ea5a1b in gen_dealloc (gen=0x7fc7c1ba73c0) at Objects/genobject.c:32 In order to see what object the gc is freeing i tried casting it to a PyObject (we're freeing a lxml object): (gdb) p (PyObject*) op $17 = <ml.etree.DocInfo.standalone.__get__ at remote 0x7fc79cef3d98> Similar bugs (http://osdir.com/ml/python.bugs/2000-12/msg00214.html) blame the extension module for calling dealloc explicitly more than once or doing forbidden things in __del__. this is how i use lxml: from lxml.etree import iterparse def safe_iterparse(*args, **kwargs): for event, element in iterparse(*args, **kwargs): try: yield (event, element) finally: element.clear() I tried parsing the same data but It didn't reproduce.. It happens after a few hours running in production while generating etree using the builder and parsing using the attached code. anyone familiar with these kind of bugs in c extensions/cpython/lxml? could you give pointers to what I should be looking for? should I open a bug? how can I debug lxml? some version info: CPython version: 2.7.2 on linux. lxml: 2.3.3 thanks, Alon Horev

Alon Horev, 13.06.2012 20:16:
from lxml.etree import iterparse
def safe_iterparse(*args, **kwargs): for event, element in iterparse(*args, **kwargs): try: yield (event, element) finally: element.clear()
This is a known limitation of the current implementation: http://lxml.de/parsing.html#modifying-the-tree Stefan

the doc does warn: 'You should also avoid moving or discarding the element itself.' but the example does exactly what I do, which is to clear the element after the 'end' event. isn't the example contradicting the warning?
for event, element in etree.iterparse(StringIO(xml)):... # ... do something with the element... element.clear() # clean up children... while element.getprevious() is not None:... del element.getparent()[0] # clean up preceding siblings
On Wed, Jun 13, 2012 at 9:30 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Alon Horev, 13.06.2012 20:16:
from lxml.etree import iterparse
def safe_iterparse(*args, **kwargs): for event, element in iterparse(*args, **kwargs): try: yield (event, element) finally: element.clear()
This is a known limitation of the current implementation:
http://lxml.de/parsing.html#modifying-the-tree
Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml

[fixed top-posting and code formatting] Note that it's better to send plain text messages when posting to public mailing lists than HTML formatted messages. Alon Horev, 13.06.2012 20:49:
On Wed, Jun 13, 2012 at 9:30 PM, Stefan Behnel wrote:
Alon Horev, 13.06.2012 20:16:
from lxml.etree import iterparse
def safe_iterparse(*args, **kwargs): for event, element in iterparse(*args, **kwargs): try: yield (event, element) finally: element.clear()
This is a known limitation of the current implementation:
the doc does warn: 'You should also avoid moving or discarding the element itself.' but the example does exactly what I do, which is to clear the element after the 'end' event. isn't the example contradicting the warning?
for event, element in etree.iterparse(StringIO(xml)): ... # ... do something with the element ... element.clear() # clean up children ... while element.getprevious() is not None: ... del element.getparent()[0] # clean up preceding siblings
Ah, yes, right. Thanks for catching that. Calling .clean() not only cleans up the children but also deletes the text content and the *tail* text. That is the actual problem with this code, because it touches tree state after the current (or latest) element. I think it would be helpful to add a "with_tail" option to clear() for now. Stefan

how come the child's tail affects the parent? does the tail attribute reach up to the parent? if so, will this be better?: for event, element in iterparse(f, tag="bla"): yield element for child in element: child.clear() # this might reach to it's parent, which is bla, which is ok because it's an 'end' event. I can probably also clear the children of previously parsed siblings of different tags (nephews).. as you see, I'm looking at ways to process a file that doesn't fit into memory. thanks for the help, Alon Horev On Wed, Jun 13, 2012 at 11:20 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
[fixed top-posting and code formatting]
Note that it's better to send plain text messages when posting to public mailing lists than HTML formatted messages.
Alon Horev, 13.06.2012 20:49:
On Wed, Jun 13, 2012 at 9:30 PM, Stefan Behnel wrote:
Alon Horev, 13.06.2012 20:16:
from lxml.etree import iterparse
def safe_iterparse(*args, **kwargs): for event, element in iterparse(*args, **kwargs): try: yield (event, element) finally: element.clear()
This is a known limitation of the current implementation:
the doc does warn: 'You should also avoid moving or discarding the element itself.' but the example does exactly what I do, which is to clear the element after the 'end' event. isn't the example contradicting the warning?
for event, element in etree.iterparse(StringIO(xml)): ... # ... do something with the element ... element.clear() # clean up children ... while element.getprevious() is not None: ... del element.getparent()[0] # clean up preceding siblings
Ah, yes, right. Thanks for catching that. Calling .clean() not only cleans up the children but also deletes the text content and the *tail* text. That is the actual problem with this code, because it touches tree state after the current (or latest) element.
I think it would be helpful to add a "with_tail" option to clear() for now.
Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml

Hi, please avoid top-posting. Alon Horev, 13.06.2012 23:10:
On Wed, Jun 13, 2012 at 11:20 PM, Stefan Behnel wrote:
Alon Horev, 13.06.2012 20:49:
for event, element in etree.iterparse(StringIO(xml)): ... # ... do something with the element ... element.clear() # clean up children ... while element.getprevious() is not None: ... del element.getparent()[0] # clean up preceding siblings
Ah, yes, right. Thanks for catching that. Calling .clean() not only cleans up the children but also deletes the text content and the *tail* text. That is the actual problem with this code, because it touches tree state after the current (or latest) element.
how come the child's tail affects the parent? does the tail attribute reach up to the parent?
During incremental parsing, the parser needs to hold on to the last node that it parsed in order to be able to continue from it. That node may be a text node in the internal tree, and it can happen that it's the tail text of the last element that it parsed. Clearing that element will then delete that tail text and leave the parser in an illegal state. I've been considering for a while now to set a flag on the document while it is being iterparsed. The tree modification methods could then take that into account. I think it makes sense to eventually do that. It would make iterparse() much safer and shouldn't imply much of a performance impact. It's not entirely thought through yet, though...
if so, will this be better?:
for event, element in iterparse(f, tag="bla"): yield element for child in element: child.clear() # this might reach to it's parent, which is bla, which is ok because it's an 'end' event.
You can safely do this:
for event, element in etree.iterparse(f, tag="bla"): ... yield element ... del element[:] # discard children ... while element.getprevious() is not None: ... del element.getparent()[0] # clean up preceding siblings
as you see, I'm looking at ways to process a file that doesn't fit into memory.
That's the obvious use case. Stefan
participants (2)
-
Alon Horev
-
Stefan Behnel