Thanks for the quick response!
On 10/20/2011 08:46 PM, Stefan Behnel wrote:
I can't reproduce this, not by repeatedly parsing the file you sent
not with different files either. I assume that all files use the same XML formats? (i.e. the same tag names etc.)
Repeatedly parsing the same file indeed does not cause a problem. The problem seems to only occur when new files are parsed. If a file is loaded that was loaded before, all seems to go well, that is you should see something like "done - good - no memory increase" for almost all files. Is that the case on the input data I gave??
I have done some more experiments and found that when I parse different XML files (I tried wikileaks for example, a sizable collection), it all goes well. So it seems the bug is triggered in some way by something in my input format and other XML input is unaffected. I have a tiny collection of files here that exhibit the problem:
Using these files you should be able to reproduce the problem using my script http://download.anaproy.nl/lxml_leak.py . I reproduced this also on another machine with slightly older versions:
Python : (2, 6, 6, 'final', 0) lxml.etree : (2, 2, 8, 0) libxml used : (2, 7, 8) libxml compiled : (2, 7, 7) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26)
A good comparison is to try cElementTree instead of lxml, things seem to go well then.
Are you using the official lxml release? Did you build it yourself or did you use the one in the distro? Could you try with the 2.3.1 release?
I'm using the one in Ubuntu 11.10 yes (2.3.0). I just now tried the latest 2.3.1 release and the bug persists in that version as well.
This becomes problematic quickly when dealing with millions of XML
Does it really keep increasing all the way up to the last file? (or at least up to the point where you run out of memory?)
I've processed (parsed and discarded) about 30000 files now and am around 550 RAM and rising. With millions of XML files, this becomes a problem, and with anything less than a good couple of thousand, the problem is unlikely to really affect anyone or be even noticeable.
Yeah, I'll run out of memory eventually, I tend to break off the experiment before that happens though.
I attach a short log excerpt in which I extracted resident memory usage from ps after each iteration and measure the increase. Note that I only parse the documents, to be overwritten each time, I don't do anything else with them in this test case.
From your log, it seems like it does allocate more memory for large files (as expected), but then doesn't give it back. That looks unusual.
Yes, and also note that it allocates far less memory than if I were to simply maintain all files in memory.
Is this a known problem?
We had one similar report this year that wasn't reproducible either. It's in the archives.
Hmm.. might be interesting. I'll see if I can find it.
Is there anything else I explicitly need to do to free the memory used?
Good, as I thought.
The problem does not reproduce if I reload the same document over and over again. Memory usage remains constant then. It only happens when new documents are loaded, and even then in some rare cases the problem dos not occur for some or several iterations, most notably at the start of the log.
That may simply be because it already has enough memory at the start to keep the first few documents in memory, so it just doesn't show yet. It seems to be quite visibly recurrent on your side after a few iterations.
Ok, that makes sense yes.
I ran the test script you sent me through valgrind (a memory analyser, amongst other things) and it came out clean:
==10062== LEAK SUMMARY: ==10062== definitely lost: 0 bytes in 0 blocks ==10062== indirectly lost: 0 bytes in 0 blocks ==10062== possibly lost: 498,566 bytes in 265 blocks ==10062== still reachable: 2,645,015 bytes in 1,709 blocks ==10062== suppressed: 0 bytes in 0 blocks
I looked through the "possibly lost" blocks and they all look reasonable, none of them seems to be related to parsing. Basically, they are initialisation time global memory allocations that valgrind isn't completely sure about.
If you want to try it on your side, here's my command line:
valgrind --tool=memcheck --leak-check=full --num-callers=30 \ --suppressions=lxmldir/valgrind-python.supp python lxml_leak.py '*.pos'
You can find the valgrind support file in the lxml source distribution. Valgrind is in Debian/Ubuntu.
I tried valgrind as per your instructions, but I get no leak summary for some reason, only this and stdout. I'm not too experienced with valgrind yet. No debug markers present perhaps?
==21963== Memcheck, a memory error detector ==21963== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. ==21963== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright info ==21963== Command: ./lxml_leak.py /home/proycon/exp/minisonar/dcoi/WR-P-E-J_wikipedia/*pos ==21963==
I can also supply a huge amount of input data if necessary to more easily debug.