Python Web Scrapping : Within href readonly those value that have href in it

Peter Otten __peter__ at
Sat Jan 14 03:44:33 EST 2017

shahsn11 at wrote:

> I am trying to scrape a webpage just for learning. In that webpage there
> are multiple "a" tags. consider the below code
> <a href='\abc\def\jkl'> Something </a>
> <a href ='http:\\'> Something</a>

These are probaly all forward slashes.

> Now i want to read only those href in which there is http. My Current code
> is
> for link in soup.find_all("a"):
>     print link.get("href")
> i would like to change it to read only http links.

You mean href values that start with "http://"?
While you can do that with a callback

def check_scheme(href):
    return href is not None and href.startswith("http://")

for a in soup.find_all("a", href=check_scheme):

or a regular expression

import re

for a in soup.find_all("a", href=re.compile("^http://")):

why not keep things simple and check before printing? Like

for a in soup.find_all("a"):
    href = a.get("href", "") # empty string if href is missing
    if href.startswith("http://"):

More information about the Python-list mailing list