[Tutor] Problem using lxml
stefan_ml at behnel.de
Sun Aug 23 10:10:53 CEST 2015
Anthony Papillion schrieb am 23.08.2015 um 01:16:
> from lxml import html
> import requests
> page = requests.get("http://joplin.craigslist.org/search/w4m")
> tree = html.fromstring(page.text)
While requests has its merits, this can be simplified to
tree = html.parse("http://joplin.craigslist.org/search/w4m")
> titles = tree.xpath('//a[@class="hdrlnk"]/text()')
> for title in titles:
> print title
This only works as long as the link tags only contain plain text, no other
tags, because "text()" selects individual text nodes in XPath. Also, using
@class="hdrlnk" will not match link tags that use class=" hdrlnk " or
class="abc hdrlnk other".
If you want to be on the safe side, I'd use cssselect instead and then
serialise the complete text content of the link tag to a string, i.e.
from lxml.etree import tostring
for link_element in tree.cssselect("a.hdrlnk"):
title = tostring(
method="text", encoding="unicode", with_tail=False)
Note that the "cssselect()" feature requires the external "cssselect"
package to be installed. "pip install cssselect" should handle that.
Oh, and bare "except:" clauses are generally frowned upon because they can
easily hide bugs by also catching unexpected exceptions. Better be explicit
about the exception type(s) you want to catch.
More information about the Tutor