Listing link urls
Kishore Kumar Alajangi
akishorecert at gmail.com
Sat Oct 28 21:27:53 EDT 2017
Hi,
I am facing an issue with listing specific urls inside web page,
https://economictimes.indiatimes.com/archive.cms
Page contains link urls by year and month vise,
Ex: /archive/year-2001,month-1.cms
I am able to list all required urls using the below code,
from bs4 import BeautifulSoup
import re, csv
import urllib.request
import scrapy
req = urllib.request.Request('http://economictimes.indiatimes.com/archive.cms',
headers={'User-Agent': 'Mozilla/5.0'})
links = []
totalPosts = []
url = "http://economictimes.indiatimes.com"
data = urllib.request.urlopen(req).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a', href = re.compile('^/archive/')):
//retrieving urls starts with "archive"
l = link.get('href')
links.append(url+l)
with open("output.txt", "a") as f:
for post in links:
post = post + '\n'
f.write(post)
*sample result in text file:*
http://economictimes.indiatimes.com/archive/year-2001,month-1.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-2.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-3.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-4.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-5.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-6.cms
List of urls I am storing in a text file, From the month urls I want
to retrieve day urls starts with "/archivelist", I am using
the below code, but I am not getting any result, If I check with
inspect element the urls are available starting with /archivelist,
<a href="/archivelist/year-2001,month-3,starttime=36951.cms"></a>
Kindly help me where I am doing wrong.
from bs4 import BeautifulSoup
import re, csv
import urllib.request
import scrapy
file = open("output.txt", "r")
for i in file:
urls = urllib.request.Request(i, headers={'User-Agent': 'Mozilla/5.0'})
data1 = urllib.request.urlopen(urls).read()
page1 = BeautifulSoup(data1, 'html.parser')
for link1 in page1.findAll(href = re.compile('^/archivelist/')):
l1 = link1.get('href')
print(l1)
Thanks,
Kishore.
More information about the Python-list
mailing list