
Maarten van Gompel (proycon), 03.11.2011 11:36:
Christian Heimes, 30.10.2011 14:58:
I did some additional debugging with valgrind and found the code segment that causes the memory leak. Well, it's not a real memory leak but a feature. ;)
It's in libxml2's SAX2.c in the function xmlSAX2StartElementNs():>> * when validating, the ID registration is done at the attribute * validation level. Otherwise we have to do specific handling here. */ if (xmlStrEqual(fullname, BAD_CAST "xml:id")) { /* * Add the xml:id value * * Open issue: normalization of the value. */ if (xmlValidateNCName(value, 1) != 0) { xmlErrValid(ctxt, XML_DTD_XMLID_VALUE, "xml:id : attribute value %s is not an NCName\n", (const char *) value, NULL); } xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); } else if (xmlIsID(ctxt->myDoc, ctxt->node, ret)) xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); else if (xmlIsRef(ctxt->myDoc, ctxt->node, ret)) xmlAddRef(&ctxt->vctxt, ctxt->myDoc, value, ret); }
libxml2 keeps a reference when it finds a xml:id attribute. I don't see a way to remove the reference from lxml. The Python wrapper doesn't expose http://www.xmlsoft.org/html/libxml-valid.html#xmlRemoveID . For now you can work around the issue by removing the xml:id attribute from your document when you unload it.
Thanks for investigating! Isn't it perhaps an idea to explicitly expose xmlRemoveID in lxml?
I'm not sure that would help (and you certainly wouldn't want that). No, the cleanup should be done at document deallocation time. I haven't figured out yet if it's a bug in libxml2 that it doesn't do it itself, or if lxml should do it... (well, lxml should *obviously* do it if possible, in order to properly support the currently released libxml2 versions...) This kind of problem requires some experimenting with libxml2's code and API, but I don't currently have much time to look into this, so if someone could dig into this deeply enough to come up with a solution, I'd be happy to apply it. The document cleanup happens in _Document.__dealloc__() in lxml.etree.pyx, and I already hinted at the relevant code in libxml2 (quoted further down). I think it's worth throwing gdb into the game.
I tried to unload the xml:id attributes as a workaround, but it seems the damage is already done and this doesn't free the memory either:
for element in d.xpath('//@xml:id/..'): del element.attrib['{http://www.w3.org/XML/1998/namespace}id']
The only workaround I see now is to actively strip xml:id prior to calling lxml, which is a bit undesireable as I first have to load the file in memory myself, do a string replace, and then pass it to lxml.
Yes, that's definitely too ugly to consider a viable work-around.
On 10/31/2011 01:29 PM, Stefan Behnel wrote:
Interesting. Thanks for investigating this.
I found this code in xmlFreeDoc():
if (cur->ids != NULL) xmlFreeIDTable((xmlIDTablePtr) cur->ids); cur->ids = NULL;
So there is code to free the ID table on document deallocation. But that doesn't seem to be enough to free all of the memory. Maybe there's a bug in libxml2 that leaks some additional memory here, or maybe there's something that lxml can do to free the rest as well. I don't know. I think the code in xmlAddID() is worth another look or two.
OK, so the issue may be within libxml2 itself and even manifest if I were to rewrite my test in C++?
I would expect that, yes. It might also be worth asking on the libxml2 mailing list, although responses over there aren't guaranteed to come in a timely fashion.
This seems like an important issue worth fixing.
Absolutely.
I don't know if this is possibly relevant (from http://xmlsoft.org/xmlmem.html ), you're probably already aware of it:
*** You may encounter that your process using libxml2 does not have a reduced memory usage although you freed the trees. This is because libxml2 allocates memory in a number of small chunks. When freeing one of those chunks, the OS may decide that giving this little memory back to the kernel will cause too much overhead and delay the operation. As all chunks are this small, they get actually freed but not returned to the kernel. On systems using glibc, there is a function call "malloc_trim" from malloc.h which does this missing operation (note that it is allowed to fail). Thus, after freeing your tree you may simply try "malloc_trim(0);" to really get the memory back. If your OS does not provide malloc_trim, try searching for a similar function. ***
No, I don't think that's related. I trust that Linux is pretty good in memory management. This looks like a *real* memory leak, especially since valgrind considers the memory blocks still reachable (so there must still be a pointer to them *somewhere*). Stefan