[Tutor] Web-spider & urllib
Charlie Clark
Charlie@begeistert.org
Tue, 28 Aug 2001 14:11:26 +0200
For my sins I've been writing some web-spiders to collect some
information. One of them collects information from around 260 pages and
seems to hang every now and then without providing any errors -
interrupting it with ctrl+c indicates its waiting on a socket. I've
double-checked to make sure all url's are closed once the contents have
been collected and it's easy to see the sockets being reallocated in
netstat
The basic loops just call the parsing functions:
# Collect the pages with the links, 25 links per page
articles = []
for i in range (0, 150, 25):
src = base_url + index + str(i)
print "getting ", src
src = urllib.urlopen(src)
articles += (munig.get_articles(src))
# print articles
# Collect all pages found in the links
places = []
for article in articles:
src = urllib.urlopen(base_url + article['link'])
print "getting ", article['headline']
place = munig.party(src)
place['headline'] = article['headline']
places.append(place)
What's the best way to go about finding what's the problem? (It seems
worse when running behind the firewall)
What are possible work arounds? Is it possible to "flush" the connection
Thanx
Charlie