Python Web Scrapping : Within href readonly those value that have href in it

Jesse Alama jessealama at fastmail.fm
Mon Jan 16 00:39:34 EST 2017


To complement what Peter wrote: I'd approach this problem using
XPath. XPath is a query language for XML/HTML documents; it's a great
tool to have in your web scraping toolbox (among other tasks). With
Python's excellent lxml library you can do some XPath processing. Here's
how I might tackle this problem:

== [ scrape.py ] ======================================================

from lxml import etree

# ...somehow get HTML/XML into the variable xml

root = etree.HTML(xml)

hrefs = root.xpath("//a[@href and starts-with(@href, 'http://')]/@href")

# magic =========>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

print(hrefs) # if you want to see what this looks like

== [ end scrape.py ] ==================================================

The argument to the xpath method here is an XPath expression.  The
overall form is:

    //a[.....]/@href

The '//a' at the beginning means: starting at the root node of the
document, find all a (anchor) elements that match the condition
specified by ".....".  The '/@href' at the end means: give me the href
attribute of the nodes (if any) that remain.

Looking inside the square brackets (what's known as the predicate in the
XPath world), we find

    @href and starts-with(@href, 'http://')

The 'and' bit should be clear (there are two conditions that need to be
checked).  The first part says: the a element should have an href
attribute.  The second part says that the value of the href element had
better start with 'http://'.

In fact, we could simplify the predicate to

  starts-with(@href, 'http://')

If an element does not even have an href attribute, its value does not
start with 'http://'. It's not an error, and no exception will be
thrown, when the XPath evaluator applies the starts-with function to an
a element that does not have an href attribute.

Hope this helps.

Best regards,

Jesse

--
Jesse Alama
http://xml.sh


More information about the Python-list mailing list