[Tutor] Web-spider & urllib

Charlie Clark Charlie@begeistert.org
Tue, 28 Aug 2001 14:11:26 +0200


For my sins I've been writing some web-spiders to collect some 
information. One of them collects information from around 260 pages and 
seems to hang every now and then without providing any errors - 
interrupting it with ctrl+c indicates its waiting on a socket. I've 
double-checked to make sure all url's are closed once the contents have 
been collected and it's easy to see the sockets being reallocated in 
netstat

The basic loops just call the parsing functions:

# Collect the pages with the links, 25 links per page
articles = []
for i in range (0, 150, 25):
	src = base_url + index + str(i)
	print "getting ", src
	src = urllib.urlopen(src)
	articles += (munig.get_articles(src))
#	print articles

# Collect all pages found in the links
places = []
for article in articles:
	src = urllib.urlopen(base_url + article['link'])
	print "getting ", article['headline']
	place = munig.party(src)
	place['headline'] = article['headline']
	places.append(place)
	
What's the best way to go about finding what's the problem? (It seems 
worse when running behind the firewall)
What are possible work arounds? Is it possible to "flush" the connection

Thanx

Charlie