[Tutor] Problem using lxml

Sun Aug 23 10:10:53 CEST 2015

Anthony Papillion schrieb am 23.08.2015 um 01:16:
> from lxml import html
> import requests
> 
> page = requests.get("http://joplin.craigslist.org/search/w4m")
> tree = html.fromstring(page.text)

While requests has its merits, this can be simplified to

    tree = html.parse("http://joplin.craigslist.org/search/w4m")

> titles = tree.xpath('//a[@class="hdrlnk"]/text()')
> try:
>     for title in titles:
>         print title

This only works as long as the link tags only contain plain text, no other
tags, because "text()" selects individual text nodes in XPath. Also, using
@class="hdrlnk" will not match link tags that use class="  hdrlnk  " or
class="abc hdrlnk other".

If you want to be on the safe side, I'd use cssselect instead and then
serialise the complete text content of the link tag to a string, i.e.

    from lxml.etree import tostring

    for link_element in tree.cssselect("a.hdrlnk"):
        title = tostring(
            link_element,
            method="text", encoding="unicode", with_tail=False)
        print(title.strip())

Note that the "cssselect()" feature requires the external "cssselect"
package to be installed. "pip install cssselect" should handle that.

> except:
>     pass

Oh, and bare "except:" clauses are generally frowned upon because they can
easily hide bugs by also catching unexpected exceptions. Better be explicit
about the exception type(s) you want to catch.

Stefan