Mailman 3 [lxml-dev] Question about etree vs html - lxml - The Python XML Toolkit

Aug. 30, 2010

      Hello,

  I am Dimitrios Pritsos and I am working on a WebCrawler. In order to 
analyse the pages that I am getting while crawling I am using lxml. 
However I cannot tell the difference of lxml.html and lxml.etree when 
coming to the XHTML parsing. In particular I am confused of what to use 
from the variety of options lxml is providing. Moreover, the 
documentation is a bit misleadings.

Let me be more specific. Firstly I ve seen that lxml.html has been 
developed on Python and in fact is a shortcut for extracting several 
common information from an HTML page instead of building your own paths 
and xpaths, similarly to XML() and HTML() shortcuts. In addition all of 
these sortcuts are using the HTML() (ie the HTMLParser()). Unfortunately 
this took me few days to realize it and I found the answer here: 
http://zdar.trinet.as/doc/python-lxml-2.0.11/doc/html/api/lxml-module.html. 
Because no documentation is clarifying this. Not even the one of John W. 
Shipman, which is the best for newbies like me.

However, in the documentation (found in 
http://codespeak.net/lxml/lxmldoc-2.2.7.pdf) there is a statement that 
says that "Note that XHTML is best parsed as XML, parsing it with the 
HTML parser can lead to unexpected results". Considering that, using 
lxml.etree is the best choice for the www right because of the great 
variety of web pages are in XHTML and not HTML markup. On the other hand 
lxml.html has all the good staff. So, what exactly is going on here 
which library should I use, or how I could combine them for not loosing 
any information from the pages?

After several test, for several days, I found that different "parsing" 
function gives different results and different tostring() call (from 
html or etree) again gives different results even for the same 
ElementTree. So, why is that? No Documentation found for this eather.

In general the lxml it seems to me really great, however, because of the 
limited documentation some times you cannot tell what is what and all 
just seems a different path to do the same thing, but this is not the 
case as I can tell from my several tests. So, in practice it is totally 
different.

For example try this:
...
...
...
xhtmlsrc = '<!DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0 
Transitional//EN" 
"http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><div><br 
/>Testing</div>'
lxml.etree.tostring(lxml.html.soupparser.fromstring(xhtmlsrc))
'<html>DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0 Transitional//EN" 
"http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"<div><br/>Testing</div></html>'
...
...
...
lxml.etree.tostring(lxml.html.fromstring(xhtmlsrc))
'<html><body><div><br/>Testing</div></body></html>'
lxml.etree.tostring(lxml.html.parse(StringIO(xhtmlsrc)))
'<!DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0 Transitional//EN" 
"http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html><body><div><br/>Testing</div></body></html>'
/Why the above give different result when based on the documentation the 
suppose to give the same result/?
...
...
...
xhtmltree = lxml.html.parse(StringIO(xhtmlsrc))
xhtmltree.test_content()
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
AttributeError: 'lxml.etree._ElementTree' object has no attribute 
'test_content'
xhtmltree = lxml.html.document_fromstring(xhtmlsrc)
xhtmltree.text_content()
'Testing'
/Again why there is this deferent result when it documentation it is not 
reported?/

So could you please advise me what should I do? And one more question: 
When I am using XMLParser() which DTD is used for building the 
ElementTree? I the case of HTMLParser() I can tell it is HTML 4.0 
because this is what I get when I am doing this:
...
...
...
xhtmlsrc2 = '<div><br />Testing</div>'
xhtmltree = lxml.html.parse(StringIO(xhtmlsrc2))
lxml.html.tostring(xhtmltree)
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd">\n<html><body><div><br>Testing</div></body></html>'
PLEASE I NEED SOME HELP HERE!

Best Regards,

Dimitrios

[lxml-dev] Question about etree vs html

Dimitrios Pritsos

jholg＠gmx.de

Dimitrios Pritsos

Sergio Monteiro Basto

jholg＠gmx.de

Dimitrios Pritsos

Sergio Monteiro Basto

tags

participants (3)