Python Web Scrapping : Within href readonly those value that have href in it
Jesse Alama
jessealama at fastmail.fm
Mon Jan 16 00:39:34 EST 2017
To complement what Peter wrote: I'd approach this problem using
XPath. XPath is a query language for XML/HTML documents; it's a great
tool to have in your web scraping toolbox (among other tasks). With
Python's excellent lxml library you can do some XPath processing. Here's
how I might tackle this problem:
== [ scrape.py ] ======================================================
from lxml import etree
# ...somehow get HTML/XML into the variable xml
root = etree.HTML(xml)
hrefs = root.xpath("//a[@href and starts-with(@href, 'http://')]/@href")
# magic =========> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print(hrefs) # if you want to see what this looks like
== [ end scrape.py ] ==================================================
The argument to the xpath method here is an XPath expression. The
overall form is:
//a[.....]/@href
The '//a' at the beginning means: starting at the root node of the
document, find all a (anchor) elements that match the condition
specified by ".....". The '/@href' at the end means: give me the href
attribute of the nodes (if any) that remain.
Looking inside the square brackets (what's known as the predicate in the
XPath world), we find
@href and starts-with(@href, 'http://')
The 'and' bit should be clear (there are two conditions that need to be
checked). The first part says: the a element should have an href
attribute. The second part says that the value of the href element had
better start with 'http://'.
In fact, we could simplify the predicate to
starts-with(@href, 'http://')
If an element does not even have an href attribute, its value does not
start with 'http://'. It's not an error, and no exception will be
thrown, when the XPath evaluator applies the starts-with function to an
a element that does not have an href attribute.
Hope this helps.
Best regards,
Jesse
--
Jesse Alama
http://xml.sh
More information about the Python-list
mailing list