
Maarten van Gompel (proycon), 27.10.2011 09:53:
As my memory leak problem persisted, I conducted further experiments to try to determine the cause. XML files in almost all other formats processed fine without leaking, so there had to be something in my format that triggered the leak. I now found the cause!
In my format I use the xml:id attribute (in the XML namespace) to assign unique identifiers to a large number of elements. It turns out that this triggers the leak! Something related to these identifiers is being kept in memory by lxml (or libxml2?) and never freed! When I rename xml:id to id (default namespace), the memory leak problem is gone! This explanation is also consistent with my observation that whenever I load ANY document that was previously loaded already, the leak does not occur.
Some help in fixing this bug would be greatly appreciated. It seems either lxml or the underlying libxml2 somewhere keeps a list or map of xml IDs that is not freed when the document is destroyed?
Ok, I debugged into this and read the sources in libxml2 a bit more. I'm pretty sure that what is happening here is the following.
1) lxml.etree uses a global hash table that stores names. This is done for performance reasons and to reduce the memory footprint of the tree, by keeping a unique version of the names of tags and attributes in the hash table. This works really well in most cases, because the number of tag/attribute/etc. names used during the lifetime of a system is almost always very limited. Most systems only process one, or maybe a couple of XML formats.
2) when libxml2's parser parses your document, it additionally stores all ID names in the global dict. Within lxml's setting, this makes them persistent over the lifetime of the operating thread (usually the main thread). Freeing the document does properly clean up the internal ID->element references, but the global dict still keeps the ID names.
3) your documents use IDs that include their file name. This makes them globally unique and that means that each file adds new IDs to the global dict. This adds up. Not much, given that the names are still rather short, but a large number of files adds a large number of IDs.
So, it's not a bug, it's a feature - just not in your specific case.
I see two ways for you to work around this.
a) make the IDs in the documents locally unique inside of each document instead of globally unique. If you can make most IDs reoccur in multiple documents, you can take advantage of the global dictionary.
a) do the parsing in a separate thread. Separate threads have their own dictionary, as the global dictionary is not thread-safe. A separate dictionary means that the names it stores are bound to the lifetime of the thread. So, if you fire up a new parser thread every so many documents, the termination of the previous one will free the memory you see leaking.
Does this help?
Stefan