Hello,
I am Dimitrios Pritsos and I am working on a WebCrawler. In order to
analyse the pages that I am getting while crawling I am using lxml.
However I cannot tell the difference of lxml.html and lxml.etree when
coming to the XHTML parsing. In particular I am confused of what to use
from the variety of options lxml is providing. Moreover, the
documentation is a bit misleadings.
Let me be more specific. Firstly I ve seen that lxml.html has been
developed on Python and in fact is a shortcut for extracting several
common information from an HTML page instead of building your own paths
and xpaths, similarly to XML() and HTML() shortcuts. In addition all of
these sortcuts are using the HTML() (ie the HTMLParser()).
Unfortunately this took me few days to realize it and I found the
answer here:
http://zdar.trinet.as/doc/python-lxml-2.0.11/doc/html/api/lxml-module.html.
Because no documentation is clarifying this. Not even the one of John
W. Shipman, which is the best for newbies like me.
However, in the documentation (found in
http://codespeak.net/lxml/lxmldoc-2.2.7.pdf) there is a statement that
says that "Note that XHTML is best parsed as XML, parsing it with the
HTML parser can lead to unexpected results". Considering that, using
lxml.etree is the best choice for the www right because of the great
variety of web pages are in XHTML and not HTML markup. On the other
hand lxml.html has all the good staff. So, what exactly is going on
here which library should I use, or how I could combine them for not
loosing any information from the pages?
After several test, for several days, I found that different "parsing"
function gives different results and different tostring() call (from
html or etree) again gives different results even for the same
ElementTree. So, why is that? No Documentation found for this eather.
In general the lxml it seems to me really great, however, because of
the limited documentation some times you cannot tell what is what and
all just seems a different path to do the same thing, but this is not
the case as I can tell from my several tests. So, in practice it is
totally different.
For example try this:
>>>xhtmlsrc = '<!DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0
Transitional//EN"
"http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><div><br
/>Testing</div>'
>>>
lxml.etree.tostring(lxml.html.soupparser.fromstring(xhtmlsrc))
'<html>DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0
Transitional//EN"
"http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"<div><br/>Testing</div></html>'
>>> lxml.etree.tostring(lxml.html.fromstring(xhtmlsrc))
'<html><body><div><br/>Testing</div></body></html>'
>>> lxml.etree.tostring(lxml.html.parse(StringIO(xhtmlsrc)))
'<!DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0 Transitional//EN"
"http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html><body><div><br/>Testing</div></body></html>'
Why the above give different result when based on the documentation
the suppose to give the same result?
>>> xhtmltree = lxml.html.parse(StringIO(xhtmlsrc))
>>> xhtmltree.test_content()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'lxml.etree._ElementTree' object has no attribute
'test_content'
>>>
>>> xhtmltree = lxml.html.document_fromstring(xhtmlsrc)
>>> xhtmltree.text_content()
'Testing'
>>>
Again why there is this deferent result when it documentation it is
not reported?
So could you please advise me what should I do? And one more question:
When I am using XMLParser() which DTD is used for building the
ElementTree? I the case of HTMLParser() I can tell it is HTML 4.0
because this is what I get when I am doing this:
>>> xhtmlsrc2 = '<div><br />Testing</div>'
>>> xhtmltree = lxml.html.parse(StringIO(xhtmlsrc2))
>>> lxml.html.tostring(xhtmltree)
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">\n<html><body><div><br>Testing</div></body></html>'
>>>
PLEASE I NEED SOME HELP HERE!
Best Regards,
Dimitrios