Re: [lxml] Memory leak when parsing XML files in sequence?

20 Oct 2011

      Hi,

Thanks for the quick response!

On 10/20/2011 08:46 PM, Stefan Behnel wrote:
...
I can't reproduce this, not by repeatedly parsing the file you sent 
in and
not with different files either. I assume that all files use the same XML
formats? (i.e. the same tag names etc.)
Repeatedly parsing the same file indeed does not cause a problem. The 
problem seems to only occur when new files are parsed. If a file is 
loaded that was loaded before, all seems to go well, that is you should 
see something like "done - good - no memory increase" for almost all 
files. Is that the case on the input data I gave??

I have done some more experiments and found that when I parse different 
XML files (I tried wikileaks for example, a sizable collection), it all 
goes well. So it seems the bug is triggered in some way by something in 
my input format and other XML input is unaffected. I have a tiny 
collection of files here that exhibit the problem:

http://download.anaproy.nl/lxml_input.tar.gz

Using these files you should be able to reproduce the problem using my 
script http://download.anaproy.nl/lxml_leak.py . I reproduced this also 
on another machine with slightly older versions:

Python              : (2, 6, 6, 'final', 0)
lxml.etree          : (2, 2, 8, 0)
libxml used         : (2, 7, 8)
libxml compiled     : (2, 7, 7)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)

A good comparison is to try cElementTree instead of lxml, things seem to 
go well then.
...
Are you using the official lxml release? Did you build it yourself or did
you use the one in the distro? Could you try with the 2.3.1 release?
I'm using the one in Ubuntu 11.10 yes (2.3.0). I just now tried the 
latest 2.3.1 release and the bug persists in that version as well.
...
...
This becomes problematic quickly when dealing with millions of XML 
files.
Does it really keep increasing all the way up to the last file? (or at
least up to the point where you run out of memory?)
I've processed (parsed and discarded) about 30000 files now and am 
around 550 RAM and rising. With millions of XML files, this becomes a 
problem, and with anything less than a good couple of thousand, the 
problem is unlikely to really affect anyone or be even noticeable.

Yeah, I'll run out of memory eventually, I tend to break off the 
experiment before that happens though.
...
...
I attach a short log excerpt in which I extracted resident memory usage
from ps after each iteration and measure the increase. Note that I only
parse the documents, to be overwritten each time, I don't do anything else
with them in this test case.
From your log, it seems like it does allocate more memory for large files
(as expected), but then doesn't give it back. That looks unusual.
Yes, and also note that it allocates far less memory than if I were to 
simply maintain all files in memory.
...
...
Is this a known problem?
We had one similar report this year that wasn't reproducible either. It's
in the archives.
Hmm.. might be interesting. I'll see if I can find it.
...
...
Is there anything else I explicitly need to do to free the memory used?
Definitely not.
Good, as I thought.
...
...
The problem does not reproduce if I reload the same document over and over
again. Memory usage remains constant then. It only happens when new
documents are loaded, and even then in some rare cases the problem dos not
occur for some or several iterations, most notably at the start of the log.
That may simply be because it already has enough memory at the start to
keep the first few documents in memory, so it just doesn't show yet. It
seems to be quite visibly recurrent on your side after a few iterations.
Ok, that makes sense yes.
...
I ran the test script you sent me through valgrind (a memory analyser,
amongst other things) and it came out clean:
==10062== LEAK SUMMARY:
==10062==    definitely lost: 0 bytes in 0 blocks
==10062==    indirectly lost: 0 bytes in 0 blocks
==10062==      possibly lost: 498,566 bytes in 265 blocks
==10062==    still reachable: 2,645,015 bytes in 1,709 blocks
==10062==         suppressed: 0 bytes in 0 blocks
I looked through the "possibly lost" blocks and they all look reasonable,
none of them seems to be related to parsing. Basically, they are
initialisation time global memory allocations that valgrind isn't
completely sure about.
If you want to try it on your side, here's my command line:
valgrind --tool=memcheck --leak-check=full --num-callers=30 \
    --suppressions=lxmldir/valgrind-python.supp python lxml_leak.py '*.pos'
You can find the valgrind support file in the lxml source distribution.
Valgrind is in Debian/Ubuntu.
I tried valgrind as per your instructions, but I get no leak summary for 
some reason, only this and stdout. I'm not too experienced with valgrind 
yet. No debug markers present perhaps?

==21963== Memcheck, a memory error detector
==21963== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==21963== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for 
copyright info
==21963== Command: ./lxml_leak.py 
/home/proycon/exp/minisonar/dcoi/WR-P-E-J_wikipedia/*pos
==21963==

I can also supply a huge amount of input data if necessary to more 
easily debug.

Regards,

-- 

Maarten van Gompel (Proycon)

E-mail:         proycon@anaproy.nl
Homepage:       http://proycon.anaproy.nl

Google+:        https://plus.google.com/105334152965507305708
Facebook:       http://facebook.com/proycon
Twitter:        http://twitter.com/proycon

Re: [lxml] Memory leak when parsing XML files in sequence?

Maarten van Gompel (proycon)