Parsing HTML?
Stefan Behnel
stefan_ml at behnel.de
Sat Apr 26 17:56:59 EDT 2008
Benjamin wrote:
> On Apr 6, 11:03 pm, Stefan Behnel <stefan... at behnel.de> wrote:
>> Benjamin wrote:
>>> I'm trying to parse an HTML file. I want to retrieve all of the text
>>> inside a certain tag that I find with XPath. The DOM seems to make
>>> this available with the innerHTML element, but I haven't found a way
>>> to do it in Python.
>> import lxml.html as h
>> tree = h.parse("somefile.html")
>> text = tree.xpath("string( some/element[@condition] )")
>>
>> http://codespeak.net/lxml
>>
>> Stefan
>
> I actually had trouble getting this to work. I guess only new version
> of lxml have the html module, and I couldn't get it installed. lxml
> does look pretty cool, though.
Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:
import lxml.etree as et
parser = etree.HTMLParser()
tree = h.parse("somefile.html", parser)
text = tree.xpath("string( some/element[@condition] )")
lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.
Stefan
More information about the Python-list
mailing list