Hello,

I am Dimitrios Pritsos and I am working on a WebCrawler. In order to analyse the pages that I am getting while crawling I am using lxml. However I cannot tell the difference of lxml.html and lxml.etree when coming to the XHTML parsing. In particular I am confused of what to use from the variety of options lxml is providing. Moreover, the documentation is a bit misleadings.

Let me be more specific. Firstly I ve seen that lxml.html has been developed on Python and in fact is a shortcut for extracting several common information from an HTML page instead of building your own paths and xpaths, similarly to XML() and HTML() shortcuts. In addition all of these sortcuts are using the HTML() (ie the HTMLParser()). Unfortunately this took me few days to realize it and I found the answer here: http://zdar.trinet.as/doc/python-lxml-2.0.11/doc/html/api/lxml-module.html. Because no documentation is clarifying this. Not even the one of John W. Shipman, which is the best for newbies like me.

However, in the documentation (found in http://codespeak.net/lxml/lxmldoc-2.2.7.pdf) there is a statement that says that "Note that XHTML is best parsed as XML, parsing it with the HTML parser can lead to unexpected results". Considering that, using lxml.etree is the best choice for the www right because of the great variety of web pages are in XHTML and not HTML markup. On the other hand lxml.html has all the good staff. So, what exactly is going on here which library should I use, or how I could combine them for not loosing any information from the pages?

After several test, for several days, I found that different "parsing" function gives different results and different tostring() call (from html or etree) again gives different results even for the same ElementTree. So, why is that? No Documentation found for this eather.

In general the lxml it seems to me really great, however, because of the limited documentation some times you cannot tell what is what and all just seems a different path to do the same thing, but this is not the case as I can tell from my several tests. So, in practice it is totally different.

For example try this:
>>>xhtmlsrc = '<!DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0 Transitional//EN" "http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><div><br />Testing</div>'
>>> lxml.etree.tostring(lxml.html.soupparser.fromstring(xhtmlsrc))
'<html>DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0 Transitional//EN" "http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"<div><br/>Testing</div></html>'

>>> lxml.etree.tostring(lxml.html.fromstring(xhtmlsrc))
'<html><body><div><br/>Testing</div></body></html>'
>>> lxml.etree.tostring(lxml.html.parse(StringIO(xhtmlsrc)))
'<!DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0 Transitional//EN" "http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html><body><div><br/>Testing</div></body></html>'

Why the above give different result when based on the documentation the suppose to give the same result?

>>> xhtmltree = lxml.html.parse(StringIO(xhtmlsrc))
>>> xhtmltree.test_content()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'lxml.etree._ElementTree' object has no attribute 'test_content'
>>>
>>> xhtmltree = lxml.html.document_fromstring(xhtmlsrc)
>>> xhtmltree.text_content()
'Testing'
>>>

Again why there is this deferent result when it documentation it is not reported?

So could you please advise me what should I do? And one more question: When I am using XMLParser() which DTD is used for building the ElementTree? I the case of HTMLParser() I can tell it is HTML 4.0 because this is what I get when I am doing this:

>>> xhtmlsrc2 = '<div><br />Testing</div>'
>>> xhtmltree = lxml.html.parse(StringIO(xhtmlsrc2))
>>> lxml.html.tostring(xhtmltree)
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n<html><body><div><br>Testing</div></body></html>'
>>>

PLEASE I NEED SOME HELP HERE!

Best Regards,

Dimitrios