urllib2 and threading

Fri May 1 01:26:44 EDT 2009

I am writing a program that involves visiting several hundred webpages
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that) but
the example shown here is simplified and just confirms the url of the
site visited.

Here's the problem: the script simply crashes after getting a a couple
of urls and takes a long time to run (slower that a non-threaded
version that I wrote and ran). Can anyone figure out what I am doing
wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious.

The urls are stored in a text file that I read from. The urls are all
valid, so there's no problem there.

Here's the code:

#!/usr/bin/python

import urllib2
import threading

class MyThread(threading.Thread):
  """subclass threading.Thread to create Thread instances"""
  def __init__(self, func, args):
    threading.Thread.__init__(self)
    self.func = func
    self.args = args

  def run(self):
    apply(self.func, self.args)

def get_info_from_url(url):
  """ A dummy version of the function simply visits urls and prints
the url of the page. """
  try:
    page = urllib2.urlopen(url)
  except urllib2.URLError, e:
    print "**** error ****", e.reason
  except urllib2.HTTPError, e:
    print "**** error ****", e.code

  else:
    ulock.acquire()
    print page.geturl() # obviously, do something more useful here,
eventually
    page.close()
    ulock.release()

ulock = threading.Lock()
num_links = 10
threads = [] # store threads here
urls = [] # store urls here

fh = open("links.txt", "r")
for line in fh:
  urls.append(line.strip())
fh.close()

# collect threads
for i in range(num_links):
  t = MyThread(get_info_from_url, (urls[i],) )
  threads.append(t)

# start the threads
for i in range(num_links):
  threads[i].start()

for i in range(num_links):
  threads[i].join()

print "all done"