Hi, thanks for the report. Maarten van Gompel (proycon), 20.10.2011 17:26:
I stumbled across what I think it a memory leak within the lxml module. I am parsing literally millions of mostly small XML files, in sequence. In the following, simplified, fashion:
index = glob.glob('/path/to/dir/with/huge/number/of/xml/files/*xml') for f in index: d = lxml.etree.parse(f)
The problem is that (almost) every iteration, memory usage is increased.
I can't reproduce this, not by repeatedly parsing the file you sent in and not with different files either. I assume that all files use the same XML formats? (i.e. the same tag names etc.) Are you using the official lxml release? Did you build it yourself or did you use the one in the distro? Could you try with the 2.3.1 release?
This becomes problematic quickly when dealing with millions of XML files.
Does it really keep increasing all the way up to the last file? (or at least up to the point where you run out of memory?)
I attach a short log excerpt in which I extracted resident memory usage from ps after each iteration and measure the increase. Note that I only parse the documents, to be overwritten each time, I don't do anything else with them in this test case.
From your log, it seems like it does allocate more memory for large files (as expected), but then doesn't give it back. That looks unusual.
Is this a known problem?
We had one similar report this year that wasn't reproducible either. It's in the archives.
Is there anything else I explicitly need to do to free the memory used?
Definitely not.
The problem does not reproduce if I reload the same document over and over again. Memory usage remains constant then. It only happens when new documents are loaded, and even then in some rare cases the problem dos not occur for some or several iterations, most notably at the start of the log.
That may simply be because it already has enough memory at the start to keep the first few documents in memory, so it just doesn't show yet. It seems to be quite visibly recurrent on your side after a few iterations. I ran the test script you sent me through valgrind (a memory analyser, amongst other things) and it came out clean: ==10062== LEAK SUMMARY: ==10062== definitely lost: 0 bytes in 0 blocks ==10062== indirectly lost: 0 bytes in 0 blocks ==10062== possibly lost: 498,566 bytes in 265 blocks ==10062== still reachable: 2,645,015 bytes in 1,709 blocks ==10062== suppressed: 0 bytes in 0 blocks I looked through the "possibly lost" blocks and they all look reasonable, none of them seems to be related to parsing. Basically, they are initialisation time global memory allocations that valgrind isn't completely sure about. If you want to try it on your side, here's my command line: valgrind --tool=memcheck --leak-check=full --num-callers=30 \ --suppressions=lxmldir/valgrind-python.supp python lxml_leak.py '*.pos' You can find the valgrind support file in the lxml source distribution. Valgrind is in Debian/Ubuntu.
I also attach an example of an XML file.
It's better to put this kind of files on a web server somewhere and just provide the URL when posting to a public mailing list. Not every reader is interested.
Python 2.7.2 (ubuntu 11.10, x86_64) lxml.etree : (2, 3, 0, 0) libxml used : (2, 7, 8) libxml compiled : (2, 7, 8) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26)
Apart from the lxml release, these are all current. I wouldn't know any particular problem with them. Stefan