Am 27.10.2011 09:53, schrieb Maarten van Gompel (proycon):
As my memory leak problem persisted, I conducted further experiments to try to determine the cause. XML files in almost all other formats processed fine without leaking, so there had to be something in my format that triggered the leak. I now found the cause!
In my format I use the xml:id attribute (in the XML namespace) to assign unique identifiers to a large number of elements. It turns out that this triggers the leak! Something related to these identifiers is being kept in memory by lxml (or libxml2?) and never freed! When I rename xml:id to id (default namespace), the memory leak problem is gone! This explanation is also consistent with my observation that whenever I load ANY document that was previously loaded already, the leak does not occur.
I was able to verify your diagnosis.
--- test script --- import psutil, os from lxml import etree
xml = """<?xml version="1.0" encoding="UTF-8"?> <document xml:id="xmlid%i"> content </document> """
etree.fromstring(xml % 0)
for i in xrange(1000000): etree.fromstring(xml % i)
print psutil.Process(os.getpid()).get_memory_info() ---
Output with xml:id: meminfo(rss=10280960, vms=63336448) meminfo(rss=70262784, vms=133058560)
Output with just id: meminfo(rss=10280960, vms=63336448) meminfo(rss=10498048, vms=63340544)