Re: [lxml-dev] lxml-dev Digest, Vol 75, Issue 1

Hello guys, I am sorry that I am sending this as a response but there is two issues I d like to point out: 1. There is a memory leakage using lxml.html.parse (or etree) while you do that constantly in a loop. In particular creating etrees in a loop does let the trees there and is not deleting the properly when you reuse the same python variable to store the resutls. For now I haven't tryed to resolve it because module re (regular expression) is just fine for URL extraction, however I would prefer the use of XPath for extracting a variate of links more easily in Coding point of view. Plus I think that the overhead of Tree Building is not so much (I dont know for sure thought). 2. Speaking of XPath for url extraction, I think that lxml.html has some issues in url extraction (this is what I think reading the Code of this module). And the question is why not to use the XPath for making the code twice smaller and twice neater (I cleaner and well formed - I hope my vocabulary is correct), maybe faster too. Best Regards, Dimitrios On 12/02/2010 11:35 AM, lxml-dev-request@codespeak.net wrote:

Dimitrios Pritsos, 02.12.2010 12:17:
I am sorry that I am sending this as a response
No need to do so if you want to start a new topic. Just send a message directly to the list address. Replies are for replying.
I can reproduce this. I'll take a look ASAP. Stefan

Stefan Behnel, 02.12.2010 13:48:
It's easily reproducible. I can parse a document repeatedly in a loop using lxml.html.parse() and see the memory consumption of the Python process grow. I reproduced it with 2.3-pre, don't know if 2.2 suffers from the same problem. I'll see about that when I figured out what happens. It's only a problem with the HTML parser, and it's not related to lxml.html. This is enough to reproduce it: from lxml import etree p = etree.HTMLParser() while True: etree.parse("somefile.html", p) Stefan

Stefan Behnel, 02.12.2010 20:11:
I think it may be an issue with libxml2. The memory consumption seems to be stable with 2.7.7 and 2.7.8 but not with my system's 2.7.6. What's the version you use? Could you try the latest one? http://codespeak.net/lxml/dev/FAQ.html#i-think-i-have-found-a-bug-in-lxml-wh... Stefan

Hi,
FWIW, memory looks stable here (vintage version): python2.4 -i -c 'from lxml import etree; print etree.__version__; print "%s (%s) - %s (%s)" % (etree.LIBXML_VERSION, etree.LIBXML_COMPILED_VERSION, etree.LIBXSLT_VERSION, etree.LIBXSLT_COMPILED_VERSION)' 2.2.6 (2, 6, 32) ((2, 6, 32)) - (1, 1, 23) ((1, 1, 23))
Holger -- GRATIS! Movie-FLAT mit über 300 Videos. Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome

On 02/12/10 23:13, Stefan Behnel wrote:
Hello All, I am sorry for the late response I ve tried it with 2.7.7 and 2.7.8, The Memory leakage persists. even if you do this: xhtml_tree = lxml.html.parser( open( 'myhtmlfile.html', 'r') ) del xhtml_tree HAPPY NEW YEAR Regards, Dimitrios

Dimitrios Pritsos, 02.12.2010 12:17:
Sure, lxml.html has specific support for extracting URLs from parsed documents.
Plus I think that the overhead of Tree Building is not so much (I dont know for sure thought).
Likely slower than re, but also likely fast enough.
Such as ... ?
Maybe. If you want to provide a patch that simplifies the code and back it with sufficient evidence that it's at least as fast as before and doesn't degrade functionality, I'll be happy to accept it. Stefan

On 12/02/2010 02:53 PM, Stefan Behnel wrote: On 12/02/2010 02:53 PM, Stefan Behnel wrote:
But what about the Memory Leakage, I am sorry if there is a solution already. However, I believe that this is not intuitive at all (I mean the all tree to stay in Memory like a Garbage and not to be replaced). I don't think that I am experienced enough to fix this.
I just think it is harder to have all the definition of HTML 4.0 (XTHML 1.0 , 1.1 e.t.c.) and have the code up-to-date. XPath (I think) it will be more general. just that :)
As for lxml.html I think I can send something Xpath based and to be multiprocessing/multi-threaded too (optionally). But still need some work and I don't have enough time to finish it right now, because this is a critical phase for my PhD and Job.
Stefan
Thank you very much for your fast response! Dimitrios

Dimitrios Pritsos, 02.12.2010 16:35:
But what about the Memory Leakage
See my other mail.
How would that be more general? The expressions would simply select what the code currently selects as well. Could you provide an example of what you have in mind?
I don't think multi-threading (and especially not multiprocessing) make any sense here. They should get applied at the document level, not within a single document. Stefan

Dimitrios Pritsos, 02.12.2010 12:17:
I am sorry that I am sending this as a response
No need to do so if you want to start a new topic. Just send a message directly to the list address. Replies are for replying.
I can reproduce this. I'll take a look ASAP. Stefan

Stefan Behnel, 02.12.2010 13:48:
It's easily reproducible. I can parse a document repeatedly in a loop using lxml.html.parse() and see the memory consumption of the Python process grow. I reproduced it with 2.3-pre, don't know if 2.2 suffers from the same problem. I'll see about that when I figured out what happens. It's only a problem with the HTML parser, and it's not related to lxml.html. This is enough to reproduce it: from lxml import etree p = etree.HTMLParser() while True: etree.parse("somefile.html", p) Stefan

Stefan Behnel, 02.12.2010 20:11:
I think it may be an issue with libxml2. The memory consumption seems to be stable with 2.7.7 and 2.7.8 but not with my system's 2.7.6. What's the version you use? Could you try the latest one? http://codespeak.net/lxml/dev/FAQ.html#i-think-i-have-found-a-bug-in-lxml-wh... Stefan

Hi,
FWIW, memory looks stable here (vintage version): python2.4 -i -c 'from lxml import etree; print etree.__version__; print "%s (%s) - %s (%s)" % (etree.LIBXML_VERSION, etree.LIBXML_COMPILED_VERSION, etree.LIBXSLT_VERSION, etree.LIBXSLT_COMPILED_VERSION)' 2.2.6 (2, 6, 32) ((2, 6, 32)) - (1, 1, 23) ((1, 1, 23))
Holger -- GRATIS! Movie-FLAT mit über 300 Videos. Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome

On 02/12/10 23:13, Stefan Behnel wrote:
Hello All, I am sorry for the late response I ve tried it with 2.7.7 and 2.7.8, The Memory leakage persists. even if you do this: xhtml_tree = lxml.html.parser( open( 'myhtmlfile.html', 'r') ) del xhtml_tree HAPPY NEW YEAR Regards, Dimitrios

Dimitrios Pritsos, 02.12.2010 12:17:
Sure, lxml.html has specific support for extracting URLs from parsed documents.
Plus I think that the overhead of Tree Building is not so much (I dont know for sure thought).
Likely slower than re, but also likely fast enough.
Such as ... ?
Maybe. If you want to provide a patch that simplifies the code and back it with sufficient evidence that it's at least as fast as before and doesn't degrade functionality, I'll be happy to accept it. Stefan

On 12/02/2010 02:53 PM, Stefan Behnel wrote: On 12/02/2010 02:53 PM, Stefan Behnel wrote:
But what about the Memory Leakage, I am sorry if there is a solution already. However, I believe that this is not intuitive at all (I mean the all tree to stay in Memory like a Garbage and not to be replaced). I don't think that I am experienced enough to fix this.
I just think it is harder to have all the definition of HTML 4.0 (XTHML 1.0 , 1.1 e.t.c.) and have the code up-to-date. XPath (I think) it will be more general. just that :)
As for lxml.html I think I can send something Xpath based and to be multiprocessing/multi-threaded too (optionally). But still need some work and I don't have enough time to finish it right now, because this is a critical phase for my PhD and Job.
Stefan
Thank you very much for your fast response! Dimitrios

Dimitrios Pritsos, 02.12.2010 16:35:
But what about the Memory Leakage
See my other mail.
How would that be more general? The expressions would simply select what the code currently selects as well. Could you provide an example of what you have in mind?
I don't think multi-threading (and especially not multiprocessing) make any sense here. They should get applied at the document level, not within a single document. Stefan
participants (3)
-
Dimitrios Pritsos
-
jholg@gmx.de
-
Stefan Behnel