Memory leak when parsing XML files in sequence?
Hi, I stumbled across what I think it a memory leak within the lxml module. I am parsing literally millions of mostly small XML files, in sequence. In the following, simplified, fashion: index = glob.glob('/path/to/dir/with/huge/number/of/xml/files/*xml') for f in index: d = lxml.etree.parse(f) The problem is that (almost) every iteration, memory usage is increased. But note that d gets overwritten everytime and the reference to the previous document should be lost (I don't reference it anywhere else). Even an explicit 'del d' and gc.collect() within the loop doesn't help to clear up the extra memory. I used objgraph to debug a bit and the Python reference counts remain unchanged as I would expect, leaving me to conclude that this is a memory leak problem in the lxml module. This becomes problematic quickly when dealing with millions of XML files. I attach a short log excerpt in which I extracted resident memory usage from ps after each iteration and measure the increase. Note that I only parse the documents, to be overwritten each time, I don't do anything else with them in this test case. Is this a known problem? Is there anything else I explicitly need to do to free the memory used? The problem does not reproduce if I reload the same document over and over again. Memory usage remains constant then. It only happens when new documents are loaded, and even then in some rare cases the problem dos not occur for some or several iterations, most notably at the start of the log. I also attach an example of an XML file. Python 2.7.2 (ubuntu 11.10, x86_64) lxml.etree : (2, 3, 0, 0) libxml used : (2, 7, 8) libxml compiled : (2, 7, 8) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26) Regards, -- Maarten van Gompel (Proycon) E-mail: proycon@anaproy.nl Homepage: http://proycon.anaproy.nl Google+: https://plus.google.com/105334152965507305708 Facebook: http://facebook.com/proycon Twitter: http://twitter.com/proycon
Hi, A short follow up on my earlier message. When I use cElementTree instead of lxml, the problem disappears and everything behaves as expected. This seems to reconfirm that it is a memory leak in lxml itself. Regards, -- Maarten van Gompel (Proycon) E-mail: proycon@anaproy.nl Homepage: http://proycon.anaproy.nl Google+: https://plus.google.com/105334152965507305708 Facebook: http://facebook.com/proycon Twitter: http://twitter.com/proycon
Hi, thanks for the report. Maarten van Gompel (proycon), 20.10.2011 17:26:
I stumbled across what I think it a memory leak within the lxml module. I am parsing literally millions of mostly small XML files, in sequence. In the following, simplified, fashion:
index = glob.glob('/path/to/dir/with/huge/number/of/xml/files/*xml') for f in index: d = lxml.etree.parse(f)
The problem is that (almost) every iteration, memory usage is increased.
I can't reproduce this, not by repeatedly parsing the file you sent in and not with different files either. I assume that all files use the same XML formats? (i.e. the same tag names etc.) Are you using the official lxml release? Did you build it yourself or did you use the one in the distro? Could you try with the 2.3.1 release?
This becomes problematic quickly when dealing with millions of XML files.
Does it really keep increasing all the way up to the last file? (or at least up to the point where you run out of memory?)
I attach a short log excerpt in which I extracted resident memory usage from ps after each iteration and measure the increase. Note that I only parse the documents, to be overwritten each time, I don't do anything else with them in this test case.
From your log, it seems like it does allocate more memory for large files (as expected), but then doesn't give it back. That looks unusual.
Is this a known problem?
We had one similar report this year that wasn't reproducible either. It's in the archives.
Is there anything else I explicitly need to do to free the memory used?
Definitely not.
The problem does not reproduce if I reload the same document over and over again. Memory usage remains constant then. It only happens when new documents are loaded, and even then in some rare cases the problem dos not occur for some or several iterations, most notably at the start of the log.
That may simply be because it already has enough memory at the start to keep the first few documents in memory, so it just doesn't show yet. It seems to be quite visibly recurrent on your side after a few iterations. I ran the test script you sent me through valgrind (a memory analyser, amongst other things) and it came out clean: ==10062== LEAK SUMMARY: ==10062== definitely lost: 0 bytes in 0 blocks ==10062== indirectly lost: 0 bytes in 0 blocks ==10062== possibly lost: 498,566 bytes in 265 blocks ==10062== still reachable: 2,645,015 bytes in 1,709 blocks ==10062== suppressed: 0 bytes in 0 blocks I looked through the "possibly lost" blocks and they all look reasonable, none of them seems to be related to parsing. Basically, they are initialisation time global memory allocations that valgrind isn't completely sure about. If you want to try it on your side, here's my command line: valgrind --tool=memcheck --leak-check=full --num-callers=30 \ --suppressions=lxmldir/valgrind-python.supp python lxml_leak.py '*.pos' You can find the valgrind support file in the lxml source distribution. Valgrind is in Debian/Ubuntu.
I also attach an example of an XML file.
It's better to put this kind of files on a web server somewhere and just provide the URL when posting to a public mailing list. Not every reader is interested.
Python 2.7.2 (ubuntu 11.10, x86_64) lxml.etree : (2, 3, 0, 0) libxml used : (2, 7, 8) libxml compiled : (2, 7, 8) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26)
Apart from the lxml release, these are all current. I wouldn't know any particular problem with them. Stefan
Hi, Thanks for the quick response! On 10/20/2011 08:46 PM, Stefan Behnel wrote:
I can't reproduce this, not by repeatedly parsing the file you sent in and not with different files either. I assume that all files use the same XML formats? (i.e. the same tag names etc.)
Repeatedly parsing the same file indeed does not cause a problem. The problem seems to only occur when new files are parsed. If a file is loaded that was loaded before, all seems to go well, that is you should see something like "done - good - no memory increase" for almost all files. Is that the case on the input data I gave?? I have done some more experiments and found that when I parse different XML files (I tried wikileaks for example, a sizable collection), it all goes well. So it seems the bug is triggered in some way by something in my input format and other XML input is unaffected. I have a tiny collection of files here that exhibit the problem: http://download.anaproy.nl/lxml_input.tar.gz Using these files you should be able to reproduce the problem using my script http://download.anaproy.nl/lxml_leak.py . I reproduced this also on another machine with slightly older versions: Python : (2, 6, 6, 'final', 0) lxml.etree : (2, 2, 8, 0) libxml used : (2, 7, 8) libxml compiled : (2, 7, 7) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26) A good comparison is to try cElementTree instead of lxml, things seem to go well then.
Are you using the official lxml release? Did you build it yourself or did you use the one in the distro? Could you try with the 2.3.1 release?
I'm using the one in Ubuntu 11.10 yes (2.3.0). I just now tried the latest 2.3.1 release and the bug persists in that version as well.
This becomes problematic quickly when dealing with millions of XML files.
Does it really keep increasing all the way up to the last file? (or at least up to the point where you run out of memory?)
I've processed (parsed and discarded) about 30000 files now and am around 550 RAM and rising. With millions of XML files, this becomes a problem, and with anything less than a good couple of thousand, the problem is unlikely to really affect anyone or be even noticeable. Yeah, I'll run out of memory eventually, I tend to break off the experiment before that happens though.
I attach a short log excerpt in which I extracted resident memory usage from ps after each iteration and measure the increase. Note that I only parse the documents, to be overwritten each time, I don't do anything else with them in this test case.
From your log, it seems like it does allocate more memory for large files (as expected), but then doesn't give it back. That looks unusual.
Yes, and also note that it allocates far less memory than if I were to simply maintain all files in memory.
Is this a known problem?
We had one similar report this year that wasn't reproducible either. It's in the archives.
Hmm.. might be interesting. I'll see if I can find it.
Is there anything else I explicitly need to do to free the memory used?
Definitely not.
Good, as I thought.
The problem does not reproduce if I reload the same document over and over again. Memory usage remains constant then. It only happens when new documents are loaded, and even then in some rare cases the problem dos not occur for some or several iterations, most notably at the start of the log.
That may simply be because it already has enough memory at the start to keep the first few documents in memory, so it just doesn't show yet. It seems to be quite visibly recurrent on your side after a few iterations.
Ok, that makes sense yes.
I ran the test script you sent me through valgrind (a memory analyser, amongst other things) and it came out clean:
==10062== LEAK SUMMARY: ==10062== definitely lost: 0 bytes in 0 blocks ==10062== indirectly lost: 0 bytes in 0 blocks ==10062== possibly lost: 498,566 bytes in 265 blocks ==10062== still reachable: 2,645,015 bytes in 1,709 blocks ==10062== suppressed: 0 bytes in 0 blocks
I looked through the "possibly lost" blocks and they all look reasonable, none of them seems to be related to parsing. Basically, they are initialisation time global memory allocations that valgrind isn't completely sure about.
If you want to try it on your side, here's my command line:
valgrind --tool=memcheck --leak-check=full --num-callers=30 \ --suppressions=lxmldir/valgrind-python.supp python lxml_leak.py '*.pos'
You can find the valgrind support file in the lxml source distribution. Valgrind is in Debian/Ubuntu.
I tried valgrind as per your instructions, but I get no leak summary for some reason, only this and stdout. I'm not too experienced with valgrind yet. No debug markers present perhaps? ==21963== Memcheck, a memory error detector ==21963== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. ==21963== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright info ==21963== Command: ./lxml_leak.py /home/proycon/exp/minisonar/dcoi/WR-P-E-J_wikipedia/*pos ==21963== I can also supply a huge amount of input data if necessary to more easily debug. Regards, -- Maarten van Gompel (Proycon) E-mail: proycon@anaproy.nl Homepage: http://proycon.anaproy.nl Google+: https://plus.google.com/105334152965507305708 Facebook: http://facebook.com/proycon Twitter: http://twitter.com/proycon
Hi, Small follow up again, I did get Valgrind to work after all. Here is my leak summary on the tiny test collection. ==22340== LEAK SUMMARY: ==22340== definitely lost: 0 bytes in 0 blocks ==22340== indirectly lost: 0 bytes in 0 blocks ==22340== possibly lost: 276,408 bytes in 79 blocks ==22340== still reachable: 2,864,995 bytes in 1,917 blocks ==22340== suppressed: 0 bytes in 0 blocks ==22340== Reachable blocks (those to which a pointer was found) are not shown. ==22340== To see them, rerun with: --leak-check=full --show-reachable=yes And here it is on 300 files. ==22399== LEAK SUMMARY: ==22399== definitely lost: 0 bytes in 0 blocks ==22399== indirectly lost: 0 bytes in 0 blocks ==22399== possibly lost: 513,916 bytes in 344 blocks ==22399== still reachable: 31,140,660 bytes in 175,023 blocks ==22399== suppressed: 0 bytes in 0 blocks I'd bet this 'still reachable' category keeps increasing infinitely. My other experiment is still running by the way, having processed 55000 of 249780 files now and taking almost 1GB RAM. Regards, -- Maarten van Gompel (Proycon) E-mail: proycon@anaproy.nl Homepage: http://proycon.anaproy.nl Google+: https://plus.google.com/105334152965507305708 Facebook: http://facebook.com/proycon Twitter: http://twitter.com/proycon
I ran the test script you sent me through valgrind (a memory analyser, amongst other things) and it came out clean:
==10062== LEAK SUMMARY: ==10062== definitely lost: 0 bytes in 0 blocks ==10062== indirectly lost: 0 bytes in 0 blocks ==10062== possibly lost: 498,566 bytes in 265 blocks ==10062== still reachable: 2,645,015 bytes in 1,709 blocks ==10062== suppressed: 0 bytes in 0 blocks
I looked through the "possibly lost" blocks and they all look reasonable, none of them seems to be related to parsing. Basically, they are initialisation time global memory allocations that valgrind isn't completely sure about.
If you want to try it on your side, here's my command line:
valgrind --tool=memcheck --leak-check=full --num-callers=30 \ --suppressions=lxmldir/valgrind-python.supp python lxml_leak.py '*.pos'
Hi, As my memory leak problem persisted, I conducted further experiments to try to determine the cause. XML files in almost all other formats processed fine without leaking, so there had to be something in my format that triggered the leak. I now found the cause! In my format I use the xml:id attribute (in the XML namespace) to assign unique identifiers to a large number of elements. It turns out that this triggers the leak! Something related to these identifiers is being kept in memory by lxml (or libxml2?) and never freed! When I rename xml:id to id (default namespace), the memory leak problem is gone! This explanation is also consistent with my observation that whenever I load ANY document that was previously loaded already, the leak does not occur. Some help in fixing this bug would be greatly appreciated. It seems either lxml or the underlying libxml2 somewhere keeps a list or map of xml IDs that is not freed when the document is destroyed? Btw, with cElementTree (also built upon libxml2 if I'm not mistaken), this problem does not occur. Regards, -- Maarten van Gompel (Proycon) E-mail: proycon@anaproy.nl Homepage: http://proycon.anaproy.nl Google+: https://plus.google.com/105334152965507305708 Facebook: http://facebook.com/proycon Twitter: http://twitter.com/proycon
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/27/2011 03:53 AM, Maarten van Gompel (proycon) wrote:
Btw, with cElementTree (also built upon libxml2 if I'm not mistaken), this problem does not occur.
Nope: $ /opt/Python-2.7.2/bin/virtualenv --no-site-packages /tmp/cet ... $ cd /tmp/cet $ bin/easy_install -Z cElementTree ... $ cd lib/python2.7/site-packages/cElement* $ ldd cElementTree.so | grep xml | wc -l 0 Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6pj5UACgkQ+gerLs4ltQ7Y3gCdEZkKJeOktAN8ntPzoYAn/4yP 7pMAnRmHm6DUIumf19yXwx1u1MMf7k7Z =P3F/ -----END PGP SIGNATURE-----
Am 27.10.2011 09:53, schrieb Maarten van Gompel (proycon):
Hi,
As my memory leak problem persisted, I conducted further experiments to try to determine the cause. XML files in almost all other formats processed fine without leaking, so there had to be something in my format that triggered the leak. I now found the cause!
In my format I use the xml:id attribute (in the XML namespace) to assign unique identifiers to a large number of elements. It turns out that this triggers the leak! Something related to these identifiers is being kept in memory by lxml (or libxml2?) and never freed! When I rename xml:id to id (default namespace), the memory leak problem is gone! This explanation is also consistent with my observation that whenever I load ANY document that was previously loaded already, the leak does not occur.
I was able to verify your diagnosis. --- test script --- import psutil, os from lxml import etree xml = """<?xml version="1.0" encoding="UTF-8"?> <document xml:id="xmlid%i"> content </document> """ etree.fromstring(xml % 0) print psutil.Process(os.getpid()).get_memory_info() for i in xrange(1000000): etree.fromstring(xml % i) print psutil.Process(os.getpid()).get_memory_info() --- Output with xml:id: meminfo(rss=10280960, vms=63336448) meminfo(rss=70262784, vms=133058560) Output with just id: meminfo(rss=10280960, vms=63336448) meminfo(rss=10498048, vms=63340544) Christian
Am 27.10.2011 09:53, schrieb Maarten van Gompel (proycon):
Hi,
As my memory leak problem persisted, I conducted further experiments to try to determine the cause. XML files in almost all other formats processed fine without leaking, so there had to be something in my format that triggered the leak. I now found the cause!
In my format I use the xml:id attribute (in the XML namespace) to assign unique identifiers to a large number of elements. It turns out that this triggers the leak! Something related to these identifiers is being kept in memory by lxml (or libxml2?) and never freed! When I rename xml:id to id (default namespace), the memory leak problem is gone! This explanation is also consistent with my observation that whenever I load ANY document that was previously loaded already, the leak does not occur.
I did some additional debugging with valgrind and found the code segment that causes the memory leak. Well, it's not a real memory leak but a feature. ;) ==10726== 3,052,864 bytes in 95,402 blocks are still reachable in loss record 1,541 of 1,541 ==10726== at 0x4C28F9F: malloc (vg_replace_malloc.c:236) ==10726== by 0x89541CD: xmlDictLookup (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88B667B: xmlHashAddEntry3 (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88C4F13: xmlAddID (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x895617D: xmlSAX2StartElementNs (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x889B40F: ??? (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88A68CB: xmlParseElement (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88A7969: xmlParseDocument (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88A7CA4: ??? (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x815E9E9: ??? (in /usr/lib/python2.7/dist-packages/lxml/etree.so) ==10726== by 0x813C0F3: ??? (in /usr/lib/python2.7/dist-packages/lxml/etree.so) ==10726== by 0x813D749: ??? (in /usr/lib/python2.7/dist-packages/lxml/etree.so) It's in libxml2's SAX2.c in the function xmlSAX2StartElementNs(): /* * when validating, the ID registration is done at the attribute * validation level. Otherwise we have to do specific handling here. */ if (xmlStrEqual(fullname, BAD_CAST "xml:id")) { /* * Add the xml:id value * * Open issue: normalization of the value. */ if (xmlValidateNCName(value, 1) != 0) { xmlErrValid(ctxt, XML_DTD_XMLID_VALUE, "xml:id : attribute value %s is not an NCName\n", (const char *) value, NULL); } xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); } else if (xmlIsID(ctxt->myDoc, ctxt->node, ret)) xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); else if (xmlIsRef(ctxt->myDoc, ctxt->node, ret)) xmlAddRef(&ctxt->vctxt, ctxt->myDoc, value, ret); } libxml2 keeps a reference when it finds a xml:id attribute. I don't see a way to remove the reference from lxml. The Python wrapper doesn't expose http://www.xmlsoft.org/html/libxml-valid.html#xmlRemoveID . For now you can work around the issue by removing the xml:id attribute from your document when you unload it. Christian
Christian Heimes, 30.10.2011 14:58:
Am 27.10.2011 09:53, schrieb Maarten van Gompel (proycon):
As my memory leak problem persisted, I conducted further experiments to try to determine the cause. XML files in almost all other formats processed fine without leaking, so there had to be something in my format that triggered the leak. I now found the cause!
In my format I use the xml:id attribute (in the XML namespace) to assign unique identifiers to a large number of elements. It turns out that this triggers the leak! Something related to these identifiers is being kept in memory by lxml (or libxml2?) and never freed! When I rename xml:id to id (default namespace), the memory leak problem is gone! This explanation is also consistent with my observation that whenever I load ANY document that was previously loaded already, the leak does not occur.
I did some additional debugging with valgrind and found the code segment that causes the memory leak. Well, it's not a real memory leak but a feature. ;)
==10726== 3,052,864 bytes in 95,402 blocks are still reachable in loss record 1,541 of 1,541 ==10726== at 0x4C28F9F: malloc (vg_replace_malloc.c:236) ==10726== by 0x89541CD: xmlDictLookup (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88B667B: xmlHashAddEntry3 (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88C4F13: xmlAddID (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x895617D: xmlSAX2StartElementNs (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x889B40F: ??? (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88A68CB: xmlParseElement (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88A7969: xmlParseDocument (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x88A7CA4: ??? (in /usr/lib/libxml2.so.2.7.8) ==10726== by 0x815E9E9: ??? (in /usr/lib/python2.7/dist-packages/lxml/etree.so) ==10726== by 0x813C0F3: ??? (in /usr/lib/python2.7/dist-packages/lxml/etree.so) ==10726== by 0x813D749: ??? (in /usr/lib/python2.7/dist-packages/lxml/etree.so)
It's in libxml2's SAX2.c in the function xmlSAX2StartElementNs():
/* * when validating, the ID registration is done at the attribute * validation level. Otherwise we have to do specific handling here. */ if (xmlStrEqual(fullname, BAD_CAST "xml:id")) { /* * Add the xml:id value * * Open issue: normalization of the value. */ if (xmlValidateNCName(value, 1) != 0) { xmlErrValid(ctxt, XML_DTD_XMLID_VALUE, "xml:id : attribute value %s is not an NCName\n", (const char *) value, NULL); } xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); } else if (xmlIsID(ctxt->myDoc, ctxt->node, ret)) xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); else if (xmlIsRef(ctxt->myDoc, ctxt->node, ret)) xmlAddRef(&ctxt->vctxt, ctxt->myDoc, value, ret); }
libxml2 keeps a reference when it finds a xml:id attribute. I don't see a way to remove the reference from lxml. The Python wrapper doesn't expose http://www.xmlsoft.org/html/libxml-valid.html#xmlRemoveID . For now you can work around the issue by removing the xml:id attribute from your document when you unload it.
Interesting. Thanks for investigating this. I found this code in xmlFreeDoc(): if (cur->ids != NULL) xmlFreeIDTable((xmlIDTablePtr) cur->ids); cur->ids = NULL; So there is code to free the ID table on document deallocation. But that doesn't seem to be enough to free all of the memory. Maybe there's a bug in libxml2 that leaks some additional memory here, or maybe there's something that lxml can do to free the rest as well. I don't know. I think the code in xmlAddID() is worth another look or two. Stefan
Hi,
Christian Heimes, 30.10.2011 14:58:
I did some additional debugging with valgrind and found the code segment that causes the memory leak. Well, it's not a real memory leak but a feature. ;)
It's in libxml2's SAX2.c in the function xmlSAX2StartElementNs():>> * when validating, the ID registration is done at the attribute * validation level. Otherwise we have to do specific handling here. */ if (xmlStrEqual(fullname, BAD_CAST "xml:id")) { /* * Add the xml:id value * * Open issue: normalization of the value. */ if (xmlValidateNCName(value, 1) != 0) { xmlErrValid(ctxt, XML_DTD_XMLID_VALUE, "xml:id : attribute value %s is not an NCName\n", (const char *) value, NULL); } xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); } else if (xmlIsID(ctxt->myDoc, ctxt->node, ret)) xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); else if (xmlIsRef(ctxt->myDoc, ctxt->node, ret)) xmlAddRef(&ctxt->vctxt, ctxt->myDoc, value, ret); }
libxml2 keeps a reference when it finds a xml:id attribute. I don't see a way to remove the reference from lxml. The Python wrapper doesn't expose http://www.xmlsoft.org/html/libxml-valid.html#xmlRemoveID . For now you can work around the issue by removing the xml:id attribute from your document when you unload it.
Thanks for investigating! Isn't it perhaps an idea to explicitly expose xmlRemoveID in lxml? I tried to unload the xml:id attributes as a workaround, but it seems the damage is already done and this doesn't free the memory either: for element in d.xpath('//@xml:id/..'): del element.attrib['{http://www.w3.org/XML/1998/namespace}id'] The only workaround I see now is to actively strip xml:id prior to calling lxml, which is a bit undesireable as I first have to load the file in memory myself, do a string replace, and then pass it to lxml. On 10/31/2011 01:29 PM, Stefan Behnel wrote:
Interesting. Thanks for investigating this.
I found this code in xmlFreeDoc():
if (cur->ids != NULL) xmlFreeIDTable((xmlIDTablePtr) cur->ids); cur->ids = NULL;
So there is code to free the ID table on document deallocation. But that doesn't seem to be enough to free all of the memory. Maybe there's a bug in libxml2 that leaks some additional memory here, or maybe there's something that lxml can do to free the rest as well. I don't know. I think the code in xmlAddID() is worth another look or two.
OK, so the issue may be within libxml2 itself and even manifest if I were to rewrite my test in C++? This seems like an important issue worth fixing. I don't know if this is possibly relevant (from http://xmlsoft.org/xmlmem.html ), you're probably already aware of it: *** You may encounter that your process using libxml2 does not have a reduced memory usage although you freed the trees. This is because libxml2 allocates memory in a number of small chunks. When freeing one of those chunks, the OS may decide that giving this little memory back to the kernel will cause too much overhead and delay the operation. As all chunks are this small, they get actually freed but not returned to the kernel. On systems using glibc, there is a function call "malloc_trim" from malloc.h which does this missing operation (note that it is allowed to fail). Thus, after freeing your tree you may simply try "malloc_trim(0);" to really get the memory back. If your OS does not provide malloc_trim, try searching for a similar function. *** Regards, -- Maarten van Gompel (Proycon) E-mail: proycon@anaproy.nl Homepage: http://proycon.anaproy.nl Google+: https://plus.google.com/105334152965507305708 Facebook: http://facebook.com/proycon Twitter: http://twitter.com/proycon
Maarten van Gompel (proycon), 03.11.2011 11:36:
Christian Heimes, 30.10.2011 14:58:
I did some additional debugging with valgrind and found the code segment that causes the memory leak. Well, it's not a real memory leak but a feature. ;)
It's in libxml2's SAX2.c in the function xmlSAX2StartElementNs():>> * when validating, the ID registration is done at the attribute * validation level. Otherwise we have to do specific handling here. */ if (xmlStrEqual(fullname, BAD_CAST "xml:id")) { /* * Add the xml:id value * * Open issue: normalization of the value. */ if (xmlValidateNCName(value, 1) != 0) { xmlErrValid(ctxt, XML_DTD_XMLID_VALUE, "xml:id : attribute value %s is not an NCName\n", (const char *) value, NULL); } xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); } else if (xmlIsID(ctxt->myDoc, ctxt->node, ret)) xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret); else if (xmlIsRef(ctxt->myDoc, ctxt->node, ret)) xmlAddRef(&ctxt->vctxt, ctxt->myDoc, value, ret); }
libxml2 keeps a reference when it finds a xml:id attribute. I don't see a way to remove the reference from lxml. The Python wrapper doesn't expose http://www.xmlsoft.org/html/libxml-valid.html#xmlRemoveID . For now you can work around the issue by removing the xml:id attribute from your document when you unload it.
Thanks for investigating! Isn't it perhaps an idea to explicitly expose xmlRemoveID in lxml?
I'm not sure that would help (and you certainly wouldn't want that). No, the cleanup should be done at document deallocation time. I haven't figured out yet if it's a bug in libxml2 that it doesn't do it itself, or if lxml should do it... (well, lxml should *obviously* do it if possible, in order to properly support the currently released libxml2 versions...) This kind of problem requires some experimenting with libxml2's code and API, but I don't currently have much time to look into this, so if someone could dig into this deeply enough to come up with a solution, I'd be happy to apply it. The document cleanup happens in _Document.__dealloc__() in lxml.etree.pyx, and I already hinted at the relevant code in libxml2 (quoted further down). I think it's worth throwing gdb into the game.
I tried to unload the xml:id attributes as a workaround, but it seems the damage is already done and this doesn't free the memory either:
for element in d.xpath('//@xml:id/..'): del element.attrib['{http://www.w3.org/XML/1998/namespace}id']
The only workaround I see now is to actively strip xml:id prior to calling lxml, which is a bit undesireable as I first have to load the file in memory myself, do a string replace, and then pass it to lxml.
Yes, that's definitely too ugly to consider a viable work-around.
On 10/31/2011 01:29 PM, Stefan Behnel wrote:
Interesting. Thanks for investigating this.
I found this code in xmlFreeDoc():
if (cur->ids != NULL) xmlFreeIDTable((xmlIDTablePtr) cur->ids); cur->ids = NULL;
So there is code to free the ID table on document deallocation. But that doesn't seem to be enough to free all of the memory. Maybe there's a bug in libxml2 that leaks some additional memory here, or maybe there's something that lxml can do to free the rest as well. I don't know. I think the code in xmlAddID() is worth another look or two.
OK, so the issue may be within libxml2 itself and even manifest if I were to rewrite my test in C++?
I would expect that, yes. It might also be worth asking on the libxml2 mailing list, although responses over there aren't guaranteed to come in a timely fashion.
This seems like an important issue worth fixing.
Absolutely.
I don't know if this is possibly relevant (from http://xmlsoft.org/xmlmem.html ), you're probably already aware of it:
*** You may encounter that your process using libxml2 does not have a reduced memory usage although you freed the trees. This is because libxml2 allocates memory in a number of small chunks. When freeing one of those chunks, the OS may decide that giving this little memory back to the kernel will cause too much overhead and delay the operation. As all chunks are this small, they get actually freed but not returned to the kernel. On systems using glibc, there is a function call "malloc_trim" from malloc.h which does this missing operation (note that it is allowed to fail). Thus, after freeing your tree you may simply try "malloc_trim(0);" to really get the memory back. If your OS does not provide malloc_trim, try searching for a similar function. ***
No, I don't think that's related. I trust that Linux is pretty good in memory management. This looks like a *real* memory leak, especially since valgrind considers the memory blocks still reachable (so there must still be a pointer to them *somewhere*). Stefan
Maarten van Gompel (proycon), 27.10.2011 09:53:
As my memory leak problem persisted, I conducted further experiments to try to determine the cause. XML files in almost all other formats processed fine without leaking, so there had to be something in my format that triggered the leak. I now found the cause!
In my format I use the xml:id attribute (in the XML namespace) to assign unique identifiers to a large number of elements. It turns out that this triggers the leak! Something related to these identifiers is being kept in memory by lxml (or libxml2?) and never freed! When I rename xml:id to id (default namespace), the memory leak problem is gone! This explanation is also consistent with my observation that whenever I load ANY document that was previously loaded already, the leak does not occur.
Some help in fixing this bug would be greatly appreciated. It seems either lxml or the underlying libxml2 somewhere keeps a list or map of xml IDs that is not freed when the document is destroyed?
Ok, I debugged into this and read the sources in libxml2 a bit more. I'm pretty sure that what is happening here is the following. 1) lxml.etree uses a global hash table that stores names. This is done for performance reasons and to reduce the memory footprint of the tree, by keeping a unique version of the names of tags and attributes in the hash table. This works really well in most cases, because the number of tag/attribute/etc. names used during the lifetime of a system is almost always very limited. Most systems only process one, or maybe a couple of XML formats. 2) when libxml2's parser parses your document, it additionally stores all ID names in the global dict. Within lxml's setting, this makes them persistent over the lifetime of the operating thread (usually the main thread). Freeing the document does properly clean up the internal ID->element references, but the global dict still keeps the ID names. 3) your documents use IDs that include their file name. This makes them globally unique and that means that each file adds new IDs to the global dict. This adds up. Not much, given that the names are still rather short, but a large number of files adds a large number of IDs. So, it's not a bug, it's a feature - just not in your specific case. I see two ways for you to work around this. a) make the IDs in the documents locally unique inside of each document instead of globally unique. If you can make most IDs reoccur in multiple documents, you can take advantage of the global dictionary. a) do the parsing in a separate thread. Separate threads have their own dictionary, as the global dictionary is not thread-safe. A separate dictionary means that the names it stores are bound to the lifetime of the thread. So, if you fire up a new parser thread every so many documents, the termination of the previous one will free the memory you see leaking. Does this help? Stefan
Hi, a solution i found for solving this problem was import mymodule.parser_file for f in index: d = parser_file.MyParser.parse(f) #Now do everything you want to do with the parsed information reload(mymodule.parser_file) The python buildin reload for modules seems to be cleaning up the reference stuff holding by the parser. Greetings Hoka
Kai Hoppert, 03.05.2013 11:06:
a solution i found for solving this problem was
import mymodule.parser_file
for f in index: d = parser_file.MyParser.parse(f)
#Now do everything you want to do with the parsed information
reload(mymodule.parser_file)
The python buildin reload for modules seems to be cleaning up the reference stuff holding by the parser.
I can't see why it should. Could you make sure you are talking about the same issue as discussed in this thread? Stefan
Le 03/05/2013 12:32, Stefan Behnel a écrit :
Kai Hoppert, 03.05.2013 11:06:
a solution i found for solving this problem was
import mymodule.parser_file
for f in index: d = parser_file.MyParser.parse(f)
#Now do everything you want to do with the parsed information
reload(mymodule.parser_file)
The python buildin reload for modules seems to be cleaning up the reference stuff holding by the parser. I can't see why it should. Could you make sure you are talking about the same issue as discussed in this thread?
I don’t have the beginning of the thread, but it looks like the module reloading is only a convoluted way to release the last reference to the parser object. So the question is: does parsing many files with the same parser object leak memory, compared to creating a parser object every time and releasing it? (Of course, having the rest of the code would help.) Cheers, -- Simon Sapin
Simon Sapin, 03.05.2013 13:16:
Le 03/05/2013 12:32, Stefan Behnel a écrit :
Kai Hoppert, 03.05.2013 11:06:
a solution i found for solving this problem was
import mymodule.parser_file
for f in index: d = parser_file.MyParser.parse(f)
#Now do everything you want to do with the parsed information
reload(mymodule.parser_file)
The python buildin reload for modules seems to be cleaning up the reference stuff holding by the parser. I can't see why it should. Could you make sure you are talking about the same issue as discussed in this thread?
I don’t have the beginning of the thread
It started in October 2011. And as I suggested, it looks unrelated.
but it looks like the module reloading is only a convoluted way to release the last reference to the parser object.
So the question is: does parsing many files with the same parser object leak memory, compared to creating a parser object every time and releasing it? (Of course, having the rest of the code would help.)
If you repeatedly call parse() in the main thread, you'll always get the same parser. So, no, there's nothing special about reusing parsers. Stefan
participants (6)
-
Christian Heimes
-
Kai Hoppert
-
Maarten van Gompel (proycon)
-
Simon Sapin
-
Stefan Behnel
-
Tres Seaver