Parsing HTML?

Stefan Behnel stefan_ml at
Sat Apr 26 23:56:59 CEST 2008

Benjamin wrote:
> On Apr 6, 11:03 pm, Stefan Behnel <stefan... at> wrote:
>> Benjamin wrote:
>>> I'm trying to parse an HTML file.  I want to retrieve all of the text
>>> inside a certain tag that I find with XPath.  The DOM seems to make
>>> this available with the innerHTML element, but I haven't found a way
>>> to do it in Python.
>>     import lxml.html as h
>>     tree = h.parse("somefile.html")
>>     text = tree.xpath("string( some/element[@condition] )")
>> Stefan
> I actually had trouble getting this to work.  I guess only new version
> of lxml have the html module, and I couldn't get it installed.  lxml
> does look pretty cool, though.

Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:

     import lxml.etree as et
     parser = etree.HTMLParser()
     tree = h.parse("somefile.html", parser)
     text = tree.xpath("string( some/element[@condition] )")

lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.


More information about the Python-list mailing list