Hi,
Christian Heimes, 30.10.2011 14:58:
I did some additional debugging with valgrind and found the code segment that causes the memory leak. Well, it's not a real memory leak but a feature. ;)
It's in libxml2's SAX2.c in the function xmlSAX2StartElementNs():>> * when validating, the ID registration is done at the attribute * validation level. Otherwise we have to do specific handling here. */ if (xmlStrEqual(fullname, BAD_CAST "xml:id")) { /* * Add the xml:id value * * Open issue: normalization of the value. */ if (xmlValidateNCName(value, 1) != 0) { xmlErrValid(ctxt, XML_DTD_XMLID_VALUE, "xml:id : attribute value %s is not an NCName\n", (const char *) value, NULL); } xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); } else if (xmlIsID(ctxt->myDoc, ctxt->node, ret)) xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); else if (xmlIsRef(ctxt->myDoc, ctxt->node, ret)) xmlAddRef(&ctxt->vctxt, ctxt->myDoc, value, ret); }
libxml2 keeps a reference when it finds a xml:id attribute. I don't see a way to remove the reference from lxml. The Python wrapper doesn't expose http://www.xmlsoft.org/html/libxml-valid.html#xmlRemoveID . For now you can work around the issue by removing the xml:id attribute from your document when you unload it.
Thanks for investigating! Isn't it perhaps an idea to explicitly expose xmlRemoveID in lxml? I tried to unload the xml:id attributes as a workaround, but it seems the damage is already done and this doesn't free the memory either: for element in d.xpath('//@xml:id/..'): del element.attrib['{http://www.w3.org/XML/1998/namespace}id'] The only workaround I see now is to actively strip xml:id prior to calling lxml, which is a bit undesireable as I first have to load the file in memory myself, do a string replace, and then pass it to lxml. On 10/31/2011 01:29 PM, Stefan Behnel wrote:
Interesting. Thanks for investigating this.
I found this code in xmlFreeDoc():
if (cur->ids != NULL) xmlFreeIDTable((xmlIDTablePtr) cur->ids); cur->ids = NULL;
So there is code to free the ID table on document deallocation. But that doesn't seem to be enough to free all of the memory. Maybe there's a bug in libxml2 that leaks some additional memory here, or maybe there's something that lxml can do to free the rest as well. I don't know. I think the code in xmlAddID() is worth another look or two.
OK, so the issue may be within libxml2 itself and even manifest if I were to rewrite my test in C++? This seems like an important issue worth fixing. I don't know if this is possibly relevant (from http://xmlsoft.org/xmlmem.html ), you're probably already aware of it: *** You may encounter that your process using libxml2 does not have a reduced memory usage although you freed the trees. This is because libxml2 allocates memory in a number of small chunks. When freeing one of those chunks, the OS may decide that giving this little memory back to the kernel will cause too much overhead and delay the operation. As all chunks are this small, they get actually freed but not returned to the kernel. On systems using glibc, there is a function call "malloc_trim" from malloc.h which does this missing operation (note that it is allowed to fail). Thus, after freeing your tree you may simply try "malloc_trim(0);" to really get the memory back. If your OS does not provide malloc_trim, try searching for a similar function. *** Regards, -- Maarten van Gompel (Proycon) E-mail: proycon@anaproy.nl Homepage: http://proycon.anaproy.nl Google+: https://plus.google.com/105334152965507305708 Facebook: http://facebook.com/proycon Twitter: http://twitter.com/proycon