On Mon, 2010-08-30 at 17:34 +0300, Dimitrios Pritsos wrote:
I am Dimitrios Pritsos and I am working on a WebCrawler. In order to analyse the pages that I am getting while crawling I am using lxml. However I cannot tell the difference of lxml.html and lxml.etree when coming to the XHTML parsing. In particular I am confused of what to use from the variety of options lxml is providing.
Hi, I think lxml.html and lxml.etree do the same, but html have some methods specific to html like: .head and html just have tostring which is etree.HTMLparser() while etree have more parsers. I'm developing a kind a WebCrawler too, but problems of parsing bad html, falls in libxml2, not here. lxml is just a wrapper of libxml2 and libxslt ( which are coded in C or C++ ) for python . Cheers, -- Sérgio M. B.