[lxml-dev] Question about etree vs html

Hello, I am Dimitrios Pritsos and I am working on a WebCrawler. In order to analyse the pages that I am getting while crawling I am using lxml. However I cannot tell the difference of lxml.html and lxml.etree when coming to the XHTML parsing. In particular I am confused of what to use from the variety of options lxml is providing. Moreover, the documentation is a bit misleadings. Let me be more specific. Firstly I ve seen that lxml.html has been developed on Python and in fact is a shortcut for extracting several common information from an HTML page instead of building your own paths and xpaths, similarly to XML() and HTML() shortcuts. In addition all of these sortcuts are using the HTML() (ie the HTMLParser()). Unfortunately this took me few days to realize it and I found the answer here: http://zdar.trinet.as/doc/python-lxml-2.0.11/doc/html/api/lxml-module.html. Because no documentation is clarifying this. Not even the one of John W. Shipman, which is the best for newbies like me. However, in the documentation (found in http://codespeak.net/lxml/lxmldoc-2.2.7.pdf) there is a statement that says that "Note that XHTML is best parsed as XML, parsing it with the HTML parser can lead to unexpected results". Considering that, using lxml.etree is the best choice for the www right because of the great variety of web pages are in XHTML and not HTML markup. On the other hand lxml.html has all the good staff. So, what exactly is going on here which library should I use, or how I could combine them for not loosing any information from the pages? After several test, for several days, I found that different "parsing" function gives different results and different tostring() call (from html or etree) again gives different results even for the same ElementTree. So, why is that? No Documentation found for this eather. In general the lxml it seems to me really great, however, because of the limited documentation some times you cannot tell what is what and all just seems a different path to do the same thing, but this is not the case as I can tell from my several tests. So, in practice it is totally different. For example try this:
/Why the above give different result when based on the documentation the suppose to give the same result/?
/Again why there is this deferent result when it documentation it is not reported?/ So could you please advise me what should I do? And one more question: When I am using XMLParser() which DTD is used for building the ElementTree? I the case of HTMLParser() I can tell it is HTML 4.0 because this is what I get when I am doing this:
PLEASE I NEED SOME HELP HERE! Best Regards, Dimitrios

Hi,
You are missing the distinction between Elements and ElementTrees: http://codespeak.net/lxml/tutorial.html#parsing-from-strings-and-files
ElementTree and Element have a different API. I can't really comment on your other questions as I've never used lxml.html. Holger -- GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 €/mtl.!* http://portal.gmx.net/de/go/dsl

On 31/08/10 10:21, jholg@gmx.de wrote:
Thank you for this tutorial!
As I have seen in the lxml.html internals, while lxml.etree is based on libxml2 ans, the lxml.html is a shortcut (written in Python) for the common function most one should have build on its own in case they would have been used the the HTMLParser() to the etree.parse() or etree.fromstring(), as an "external" parser (i.e. not the default XMLParser). All of my above question are digested in one: Which parser should I use for getting an ElementTree of the XHTML files I am downloading for farther analysis? The XMLParser(with load_dtd=True for using DTD for parsing, recover=True, no_network=False) or the HTMLParser. Which will give me a proper ElementTree of XHTML files which are not exactly HTML 4.0 (but really close)? The reason I am looking into detail is because as I said in the Documentation there is a statement says that and HTML parser might return an ElementTree which is not the proper one in case it has to deal with an XHTML, and for that case is better to use an XML Paser.
Holger
Thank you Very much Holger for your instant Response! Dimitrios

On Mon, 2010-08-30 at 17:34 +0300, Dimitrios Pritsos wrote:
Hi, I think lxml.html and lxml.etree do the same, but html have some methods specific to html like: .head and html just have tostring which is etree.HTMLparser() while etree have more parsers. I'm developing a kind a WebCrawler too, but problems of parsing bad html, falls in libxml2, not here. lxml is just a wrapper of libxml2 and libxslt ( which are coded in C or C++ ) for python . Cheers, -- Sérgio M. B.

Hi,
You are missing the distinction between Elements and ElementTrees: http://codespeak.net/lxml/tutorial.html#parsing-from-strings-and-files
ElementTree and Element have a different API. I can't really comment on your other questions as I've never used lxml.html. Holger -- GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 €/mtl.!* http://portal.gmx.net/de/go/dsl

On 31/08/10 10:21, jholg@gmx.de wrote:
Thank you for this tutorial!
As I have seen in the lxml.html internals, while lxml.etree is based on libxml2 ans, the lxml.html is a shortcut (written in Python) for the common function most one should have build on its own in case they would have been used the the HTMLParser() to the etree.parse() or etree.fromstring(), as an "external" parser (i.e. not the default XMLParser). All of my above question are digested in one: Which parser should I use for getting an ElementTree of the XHTML files I am downloading for farther analysis? The XMLParser(with load_dtd=True for using DTD for parsing, recover=True, no_network=False) or the HTMLParser. Which will give me a proper ElementTree of XHTML files which are not exactly HTML 4.0 (but really close)? The reason I am looking into detail is because as I said in the Documentation there is a statement says that and HTML parser might return an ElementTree which is not the proper one in case it has to deal with an XHTML, and for that case is better to use an XML Paser.
Holger
Thank you Very much Holger for your instant Response! Dimitrios

On Mon, 2010-08-30 at 17:34 +0300, Dimitrios Pritsos wrote:
Hi, I think lxml.html and lxml.etree do the same, but html have some methods specific to html like: .head and html just have tostring which is etree.HTMLparser() while etree have more parsers. I'm developing a kind a WebCrawler too, but problems of parsing bad html, falls in libxml2, not here. lxml is just a wrapper of libxml2 and libxslt ( which are coded in C or C++ ) for python . Cheers, -- Sérgio M. B.
participants (3)
-
Dimitrios Pritsos
-
jholg@gmx.de
-
Sergio Monteiro Basto