Parsing HTML?

Sat Apr 26 17:56:59 EDT 2008

Benjamin wrote:
> On Apr 6, 11:03 pm, Stefan Behnel <stefan... at behnel.de> wrote:
>> Benjamin wrote:
>>> I'm trying to parse an HTML file.  I want to retrieve all of the text
>>> inside a certain tag that I find with XPath.  The DOM seems to make
>>> this available with the innerHTML element, but I haven't found a way
>>> to do it in Python.
>>     import lxml.html as h
>>     tree = h.parse("somefile.html")
>>     text = tree.xpath("string( some/element[@condition] )")
>>
>> http://codespeak.net/lxml
>>
>> Stefan
> 
> I actually had trouble getting this to work.  I guess only new version
> of lxml have the html module, and I couldn't get it installed.  lxml
> does look pretty cool, though.

Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:

     import lxml.etree as et
     parser = etree.HTMLParser()
     tree = h.parse("somefile.html", parser)
     text = tree.xpath("string( some/element[@condition] )")

lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.

Stefan