Am 27.10.2011 09:53, schrieb Maarten van Gompel (proycon):
Hi,
As my memory leak problem persisted, I conducted further experiments to try to determine the cause. XML files in almost all other formats processed fine without leaking, so there had to be something in my format that triggered the leak. I now found the cause!
In my format I use the xml:id attribute (in the XML namespace) to assign unique identifiers to a large number of elements. It turns out that this triggers the leak! Something related to these identifiers is being kept in memory by lxml (or libxml2?) and never freed! When I rename xml:id to id (default namespace), the memory leak problem is gone! This explanation is also consistent with my observation that whenever I load ANY document that was previously loaded already, the leak does not occur.
I did some additional debugging with valgrind and found the code segment that causes the memory leak. Well, it's not a real memory leak but a feature. ;) ==10726== 3,052,864 bytes in 95,402 blocks are still reachable in loss record 1,541 of 1,541 ==10726== at 0x4C28F9F: malloc (vg_replace_malloc.c:236) ==10726== by 0x89541CD: xmlDictLookup (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88B667B: xmlHashAddEntry3 (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88C4F13: xmlAddID (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x895617D: xmlSAX2StartElementNs (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x889B40F: ??? (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88A68CB: xmlParseElement (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88A7969: xmlParseDocument (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88A7CA4: ??? (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x815E9E9: ??? (in /usr/lib/python2.7/dist-packages/lxml/etree.so) ==10726== by 0x813C0F3: ??? (in /usr/lib/python2.7/dist-packages/lxml/etree.so) ==10726== by 0x813D749: ??? (in /usr/lib/python2.7/dist-packages/lxml/etree.so) It's in libxml2's SAX2.c in the function xmlSAX2StartElementNs(): /* * when validating, the ID registration is done at the attribute * validation level. Otherwise we have to do specific handling here. */ if (xmlStrEqual(fullname, BAD_CAST "xml:id")) { /* * Add the xml:id value * * Open issue: normalization of the value. */ if (xmlValidateNCName(value, 1) != 0) { xmlErrValid(ctxt, XML_DTD_XMLID_VALUE, "xml:id : attribute value %s is not an NCName\n", (const char *) value, NULL); } xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); } else if (xmlIsID(ctxt->myDoc, ctxt->node, ret)) xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); else if (xmlIsRef(ctxt->myDoc, ctxt->node, ret)) xmlAddRef(&ctxt->vctxt, ctxt->myDoc, value, ret); } libxml2 keeps a reference when it finds a xml:id attribute. I don't see a way to remove the reference from lxml. The Python wrapper doesn't expose http://www.xmlsoft.org/html/libxml-valid.html#xmlRemoveID . For now you can work around the issue by removing the xml:id attribute from your document when you unload it. Christian