Python Web Scrapping : Within href readonly those value that have href in it
Peter Otten
__peter__ at web.de
Sat Jan 14 03:44:33 EST 2017
shahsn11 at gmail.com wrote:
> I am trying to scrape a webpage just for learning. In that webpage there
> are multiple "a" tags. consider the below code
>
> <a href='\abc\def\jkl'> Something </a>
>
> <a href ='http:\\www.google.com'> Something</a>
These are probaly all forward slashes.
> Now i want to read only those href in which there is http. My Current code
> is
>
> for link in soup.find_all("a"):
> print link.get("href")
>
> i would like to change it to read only http links.
You mean href values that start with "http://"?
While you can do that with a callback
def check_scheme(href):
return href is not None and href.startswith("http://")
for a in soup.find_all("a", href=check_scheme):
print(a["href"])
or a regular expression
import re
for a in soup.find_all("a", href=re.compile("^http://")):
print(a["href"])
why not keep things simple and check before printing? Like
for a in soup.find_all("a"):
href = a.get("href", "") # empty string if href is missing
if href.startswith("http://"):
print(href)
More information about the Python-list
mailing list