Possible memory leak in xpath()
I think I've found a memory leak in lxml and I'm wondering if anyone can confirm if this is a problem? I've created a simple script to demonstrate below. Calling the xpath() method on a document seems to leave something in memory if (and only if) the method is called within a thread. Even when the thread completes the memory doesn't appear to be freed (it's not much, but it accumulates over time). You can see this using "ps" and looking at the memory usage for the python script. I'm using Ubuntu 12.04 and this happens with the default Python 2.7.3, Lxml 2.3.2, LibXml 2.7.8 and LibXslt 1.1.26 I tried upgrading to Lxml 3.2.3 but the same thing seems to happen. I'm about to attempt upgrading the LibXml package too but I'd appreciate any suggestions/help - even if someone could confirm if the same problem happens on their system would be good. Thanks! import lxml.etree import thread import time def test(): doc = lxml.etree.fromstring("<Root></Root>") doc.xpath("/Root") # This line seems to cause memory usage to go up only when used in a thread # doc.getchildren() # If this line is used instead the memory usage stays constant for i in range(100000): thread.start_new_thread(test,()) #test() # Using this line instead of starting a thread causes the memory usage to stay constant time.sleep(1) # Give plenty of time for the thread to complete raw_input("finished")
I've tried the same thing using Python 3.2 since that uses newer versions of Lxml and LibXml and the same problem exists. In fact, I think I've traced the problem to lxml.etree.XPath when it's compiling an xpath expression. Even disabling smart_strings has no effect. I've also read http://lxml.de/FAQ.html#id1 but I don't think any of that applies as I'm not running the threads concurrently, I'm just using a thread (in this example anyway) and it's keeping something in memory. I've also been looking at the error log that lxml seems to store, but I don't think that applies since there should be no errors in this case (and it only seems to store ~100 messages but in my example the memory just keeps on growing) Could anyone confirm if this is really a bug? Thanks, from lxml import etree import threading import time def test(): etree.XPath("/Root",smart_strings=False) for i in range(100000): threading.Thread(target=test).start() #test() # Using this line instead of starting a thread causes the memory usage to stay constant time.sleep(1) # Give plenty of time for the thread to complete raw_input("finished") From: lxml [mailto:lxml-bounces@lxml.de] On Behalf Of Brian Bird Sent: 05 November 2013 16:09 To: lxml@lxml.de Subject: [lxml] Possible memory leak in xpath() I think I've found a memory leak in lxml and I'm wondering if anyone can confirm if this is a problem? I've created a simple script to demonstrate below. Calling the xpath() method on a document seems to leave something in memory if (and only if) the method is called within a thread. Even when the thread completes the memory doesn't appear to be freed (it's not much, but it accumulates over time). You can see this using "ps" and looking at the memory usage for the python script. I'm using Ubuntu 12.04 and this happens with the default Python 2.7.3, Lxml 2.3.2, LibXml 2.7.8 and LibXslt 1.1.26 I tried upgrading to Lxml 3.2.3 but the same thing seems to happen. I'm about to attempt upgrading the LibXml package too but I'd appreciate any suggestions/help - even if someone could confirm if the same problem happens on their system would be good. Thanks! import lxml.etree import thread import time def test(): doc = lxml.etree.fromstring("<Root></Root>") doc.xpath("/Root") # This line seems to cause memory usage to go up only when used in a thread # doc.getchildren() # If this line is used instead the memory usage stays constant for i in range(100000): thread.start_new_thread(test,()) #test() # Using this line instead of starting a thread causes the memory usage to stay constant time.sleep(1) # Give plenty of time for the thread to complete raw_input("finished")
2013/11/6 Brian Bird <Brian.Bird@securetrading.com>
I’ve tried the same thing using Python 3.2 since that uses newer versions of Lxml and LibXml and the same problem exists. In fact, I think I’ve traced the problem to lxml.etree.XPath when it’s compiling an xpath expression. Even disabling smart_strings has no effect.
I’ve also read http://lxml.de/FAQ.html#id1 but I don’t think any of that applies as I’m not running the threads concurrently, I’m just using a thread (in this example anyway) and it’s keeping something in memory.
I’ve also been looking at the error log that lxml seems to store, but I don’t think that applies since there should be no errors in this case (and it only seems to store ~100 messages but in my example the memory just keeps on growing)
Could anyone confirm if this is really a bug?
Thanks,
There is a memory leak indeed, I think I identified it coming from the initXPathParserDict function (in parser.pxi). The (libxml) reference count of pctxt.dict is a bit difficult to follow, but when I added a xmlparser.xmlDictFree(pctxt.dict) the memory leak disappeared. -- Amaury Forgeot d'Arc
"There is a memory leak indeed, I think I identified it coming from the initXPathParserDict function (in parser.pxi). The (libxml) reference count of pctxt.dict is a bit difficult to follow, but when I added a xmlparser.xmlDictFree(pctxt.dict) the memory leak disappeared." Thanks - at least it confirms it. I was playing with Heapy to find a memory leak and it suggested something in parser.pxi but that's as far as I got. Are you a maintainer of lxml and can look at this? Or should I file a bug report. Thanks
Brian Bird, 06.11.2013 16:01:
Could anyone confirm if this is really a bug?
Definitely. Thanks for investigating it, and sorry for not responding earlier. Your short test code makes this very easy to reproduce, and Amaury's hint at the XPath parser dict should make it easy to track down the problem in the code. Stefan
Stefan Behnel, 06.11.2013 18:36:
Brian Bird, 06.11.2013 16:01:
Could anyone confirm if this is really a bug?
Definitely. Thanks for investigating it, and sorry for not responding earlier. Your short test code makes this very easy to reproduce, and Amaury's hint at the XPath parser dict should make it easy to track down the problem in the code.
https://github.com/lxml/lxml/commit/f7d2682a511253445c128137f205bfb4d6973cbb It turned out that the parser dict setup risked using the wrong dict anyway, so the safest fix was to not use a dictionary at all. Thanks again for analysing this bug. Stefan
On 6 Nov 2013, at 18:25, Stefan Behnel <stefan_ml@behnel.de> wrote:
Stefan Behnel, 06.11.2013 18:36:
Brian Bird, 06.11.2013 16:01:
Could anyone confirm if this is really a bug?
Definitely. Thanks for investigating it, and sorry for not responding earlier. Your short test code makes this very easy to reproduce, and Amaury's hint at the XPath parser dict should make it easy to track down the problem in the code.
https://github.com/lxml/lxml/commit/f7d2682a511253445c128137f205bfb4d6973cbb
It turned out that the parser dict setup risked using the wrong dict anyway, so the safest fix was to not use a dictionary at all.
Would this be likely to cause a memory leak in non-threaded code? I've been investigating a memory leak in an application that makes very heavy use of lxml, and I would be delighted if this was the cause. Thanks Ed
Ed Singleton, 06.11.2013 19:49:
On 6 Nov 2013, at 18:25, Stefan Behnel wrote:
Stefan Behnel, 06.11.2013 18:36:
Brian Bird, 06.11.2013 16:01:
Could anyone confirm if this is really a bug?
Definitely. Thanks for investigating it, and sorry for not responding earlier. Your short test code makes this very easy to reproduce, and Amaury's hint at the XPath parser dict should make it easy to track down the problem in the code.
https://github.com/lxml/lxml/commit/f7d2682a511253445c128137f205bfb4d6973cbb
It turned out that the parser dict setup risked using the wrong dict anyway, so the safest fix was to not use a dictionary at all.
Would this be likely to cause a memory leak in non-threaded code?
I don't think so. Single-threaded code should always use the same (global) dictionary, so it can't leak memory more than once. The leak here seemed to be a couple of bytes per thread, that's so tiny that you wouldn't even notice it with only one thread.
I've been investigating a memory leak in an application that makes very heavy use of lxml, and I would be delighted if this was the cause.
As you've seen, the best way to get it fixed is to invest the time to strip it down to just a couple of operations. That's work, sure, but someone has to invest it. You should also make sure that you are using the latest libxml2/libxslt. And you could run your program through valgrind, which can detect memory that doesn't get freed. The Makefile in lxml's sources has an example command line. Stefan
I've been investigating a memory leak in an application that makes very heavy use of lxml, and I would be delighted if this was the cause.
In my case it took ages to track down because the memory leak didn't seem to appear when the code was single threaded. If it's any use to you I found the easiest way was to have a little test.py script that called the offending code several thousand times and a separate window checking the memory usage (on Linux you can use something like this: while true; do ps auxw | grep test.py | grep -v "grep"; sleep 1; done ) Then when you run test.py you can easily see if the memory consumption is going up or staying fairly static. It's a right pain but it does allow you to narrow down the cause fairly quickly.
On 6 Nov 2013, at 19:35, Stefan Behnel <stefan_ml@behnel.de> wrote:
Ed Singleton, 06.11.2013 19:49:
On 6 Nov 2013, at 18:25, Stefan Behnel wrote: Would this be likely to cause a memory leak in non-threaded code?
I don't think so. Single-threaded code should always use the same (global) dictionary, so it can't leak memory more than once. The leak here seemed to be a couple of bytes per thread, that's so tiny that you wouldn't even notice it with only one thread.
No, I'm looking at an app that uses up 1GB of memory over 20,000 iterations, with about 100 xpaths per iteration, so it would need to be something bigger than that.
I've been investigating a memory leak in an application that makes very heavy use of lxml, and I would be delighted if this was the cause.
As you've seen, the best way to get it fixed is to invest the time to strip it down to just a couple of operations. That's work, sure, but someone has to invest it.
You are, off course, right. Breaking it into smaller and smaller chucks seems the way to go, until I can find out where it is.
You should also make sure that you are using the latest libxml2/libxslt.
I hadn't thought of that. Good idea.
And you could run your program through valgrind, which can detect memory that doesn't get freed. The Makefile in lxml's sources has an example command line.
Thanks. I'll try this out. Ed
Ed Singleton, 08.11.2013 12:35:
On 6 Nov 2013, at 19:35, Stefan Behnel wrote:
Ed Singleton, 06.11.2013 19:49:
On 6 Nov 2013, at 18:25, Stefan Behnel wrote: Would this be likely to cause a memory leak in non-threaded code?
I don't think so. Single-threaded code should always use the same (global) dictionary, so it can't leak memory more than once. The leak here seemed to be a couple of bytes per thread, that's so tiny that you wouldn't even notice it with only one thread.
No, I'm looking at an app that uses up 1GB of memory over 20,000 iterations, with about 100 xpaths per iteration, so it would need to be something bigger than that.
That sounds more like it's leaking entire subtrees. Try stripping the XPath queries from it completely to see if it's related to them at all. You could replace them by .find(), for example. Stefan
That's great - it's always good to know it was worth putting the effort in to come up with a simple reproducible test case! :) Now for a slightly annoying question - do you have any idea if/when this fix would be compiled into a proper release of lxml? I only ask because any software updates I use have to be approved before they can go live, and it's a lot easier if they're an 'official' update than a patch that needs compiling manually. Thanks! -----Original Message----- From: lxml [mailto:lxml-bounces@lxml.de] On Behalf Of Stefan Behnel Sent: 06 November 2013 18:26 To: lxml@lxml.de Subject: Re: [lxml] Possible memory leak in xpath() Stefan Behnel, 06.11.2013 18:36:
Brian Bird, 06.11.2013 16:01:
Could anyone confirm if this is really a bug?
Definitely. Thanks for investigating it, and sorry for not responding earlier. Your short test code makes this very easy to reproduce, and Amaury's hint at the XPath parser dict should make it easy to track down the problem in the code.
https://github.com/lxml/lxml/commit/f7d2682a511253445c128137f205bfb4d6973cbb It turned out that the parser dict setup risked using the wrong dict anyway, so the safest fix was to not use a dictionary at all. Thanks again for analysing this bug. Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
participants (4)
-
Amaury Forgeot d'Arc
-
Brian Bird
-
Ed Singleton
-
Stefan Behnel