Threading / Web Page Download
David
dlashar at sprynet.com
Thu Aug 15 20:51:17 EDT 2002
Hello. In the app that I (newbie) am building, I was finding that I would
occasionally encounter lockups when downloading pages with urllib2 (and
ClientCookie, which is based on urllib2). To get past the lockups, I've
implemented threading (below). Problem is, it's an inelegant solution:
When a given download exceeds the timeout threshold, my program effectively
abandons it, starting another download attemp in another thread (up to some
max number of thereads, n). My question is: Is there a way to kill the
threads that have essentially locked up? Related: What kind of "damage"
(if any) do the leftover threads cause? (This is not a heavy-use program as
might be implemented in a server. More for use on a local PC.) I've not
really been able to find the answers I'm looking for via docstrings,
Googling, etc. Thanks in advance for any comments provided.
class PageThread(threading.Thread):
def __init__(self, url_in, cookieobj_in = None):
threading.Thread.__init__(self)
self.done = 0
self.url = url_in
self.cookieobj = cookieobj_in
self.requestobj = None
self.responseobj = None
self.pagestring = ''
def run(self):
if not self.cookieobj:
self.cookieobj = ClientCookie.Cookies()
self.requestobj = urllib2.Request(self.url)
self.cookieobj.add_cookie_header(self.requestobj)
self.responseobj = ClientCookie.urlopen(self.requestobj) # based
on urllib2.urlopen()
# Need to fix bug (june/july 2002) to ClientCookie, updating
cookiedict in case
# redirects occurred. This is a hack that works. New version of CC
is supposed
# to be fixed.
self.cookieobj.cookies = self.responseobj.cookies
self.pagestring = self.responseobj.read()
self.cookieobj.extract_cookies(self.responseobj, self.requestobj)
self.responseobj.close()
self.done = 1
class GetPage:
"""
Usage (simple case):
url = 'http://www.python.org'
gp = GetPage(url)
gp.setdebug(1) # optional
gp.getpage()
pagestring = gp.pagestring
"""
def __init__(self, url_in, cookieobj_in = None):
# Note that default vaules for managing threads are *not* changed by
# the reset function. The caller must manage these values as
desired.
self.maxthreads = 3 # Max nbr of attempts at downloading page.
self.nbrthreads = 0
self.maxtimeout = 5.0 # Poll at intervals of 1, 2, 3, 4, and 5
secs
self.reset(url_in, cookieobj_in)
def reset(self, url_in, cookieobj_in = None):
self.url = url_in
self.pagestring = ''
self.cookieobj = cookieobj_in
self.pagethreads = []
self.error = 0
self.messages = []
self.__done = 0
def getpage(self):
# Create new cookieobj, necessary to ensure
# that the cookieobj provided by caller does not undergo
# unwanted change (i.e., erasure of cookiedict) that occurs
# with redirects in ClientCookie version as of summer 2002.
cookieobj = ClientCookie.Cookies()
if self.cookieobj:
cookieobj.cookies = self.cookieobj.cookies.copy()
self.cookieobj = cookieobj
else:
self.cookieobj = cookieobj
# Download pages in a threaded process, so to avoid "lockups".
# TO DO: Figure out what to do w/ (i.e., how to kill) threads
# that don't complete w/in the timeout period.
timeouts = range(1, self.maxtimeout + 1)
for threadnbr in range(self.maxthreads):
if not self.__done:
if self.debug: print 'Thread:', threadnbr
p = PageThread(self.url, self.cookieobj)
p.start()
self.pagethreads.append(p)
for timeout in timeouts:
time.sleep(timeout)
if self.debug: print 'Timeout:', timeout, 'seconds'
if self.pagethreads[threadnbr].done:
self.pagestring = \
self.pagethreads[threadnbr].pagestring
self.cookieobj = \
self.pagethreads[threadnbr].cookieobj
self.__done = 1
break
# Q: Cleanup of "leftover" threads to go here?
More information about the Python-list
mailing list