[Tutor] Problem using lxml

Stefan Behnel stefan_ml at behnel.de
Sun Aug 23 10:10:53 CEST 2015


Anthony Papillion schrieb am 23.08.2015 um 01:16:
> from lxml import html
> import requests
> 
> page = requests.get("http://joplin.craigslist.org/search/w4m")
> tree = html.fromstring(page.text)

While requests has its merits, this can be simplified to

    tree = html.parse("http://joplin.craigslist.org/search/w4m")


> titles = tree.xpath('//a[@class="hdrlnk"]/text()')
> try:
>     for title in titles:
>         print title

This only works as long as the link tags only contain plain text, no other
tags, because "text()" selects individual text nodes in XPath. Also, using
@class="hdrlnk" will not match link tags that use class="  hdrlnk  " or
class="abc hdrlnk other".

If you want to be on the safe side, I'd use cssselect instead and then
serialise the complete text content of the link tag to a string, i.e.

    from lxml.etree import tostring

    for link_element in tree.cssselect("a.hdrlnk"):
        title = tostring(
            link_element,
            method="text", encoding="unicode", with_tail=False)
        print(title.strip())

Note that the "cssselect()" feature requires the external "cssselect"
package to be installed. "pip install cssselect" should handle that.


> except:
>     pass

Oh, and bare "except:" clauses are generally frowned upon because they can
easily hide bugs by also catching unexpected exceptions. Better be explicit
about the exception type(s) you want to catch.

Stefan




More information about the Tutor mailing list