Threading / Web Page Download

Thu Aug 15 20:51:17 EDT 2002

Hello.  In the app that I (newbie) am building, I was finding that I would
occasionally encounter lockups when downloading pages with urllib2 (and
ClientCookie, which is based on urllib2).  To get past the lockups, I've
implemented threading (below).  Problem is, it's an inelegant solution:
When a given download exceeds the timeout threshold, my program effectively
abandons it, starting another download attemp in another thread (up to some
max number of thereads, n).  My question is:  Is there a way to kill the
threads that have essentially locked up?  Related:  What kind of "damage"
(if any) do the leftover threads cause?  (This is not a heavy-use program as
might be implemented in a server.  More for use on a local PC.)  I've not
really been able to find the answers I'm looking for via docstrings,
Googling, etc.  Thanks in advance for any comments provided.

class PageThread(threading.Thread):

    def __init__(self, url_in, cookieobj_in = None):
        threading.Thread.__init__(self)
        self.done = 0
        self.url = url_in
        self.cookieobj = cookieobj_in
        self.requestobj = None
        self.responseobj = None
        self.pagestring = ''

    def run(self):
        if not self.cookieobj:
            self.cookieobj = ClientCookie.Cookies()
        self.requestobj = urllib2.Request(self.url)
        self.cookieobj.add_cookie_header(self.requestobj)
        self.responseobj = ClientCookie.urlopen(self.requestobj)    # based
on urllib2.urlopen()
        # Need to fix bug (june/july 2002) to ClientCookie, updating
cookiedict in case
        # redirects occurred.  This is a hack that works.  New version of CC
is supposed
        # to be fixed.
        self.cookieobj.cookies = self.responseobj.cookies
        self.pagestring = self.responseobj.read()
        self.cookieobj.extract_cookies(self.responseobj, self.requestobj)
        self.responseobj.close()
        self.done = 1

class GetPage:
    """
    Usage (simple case):

        url = 'http://www.python.org'
        gp = GetPage(url)
        gp.setdebug(1)  # optional
        gp.getpage()
        pagestring = gp.pagestring

    """

    def __init__(self, url_in, cookieobj_in = None):

        # Note that default vaules for managing threads are *not* changed by
        # the reset function.  The caller must manage these values as
desired.
        self.maxthreads = 3     # Max nbr of attempts at downloading page.
        self.nbrthreads = 0
        self.maxtimeout = 5.0   # Poll at intervals of 1, 2, 3, 4, and 5
secs
        self.reset(url_in, cookieobj_in)

    def reset(self, url_in, cookieobj_in = None):
        self.url = url_in
        self.pagestring = ''
        self.cookieobj = cookieobj_in
        self.pagethreads = []
        self.error = 0
        self.messages = []
        self.__done = 0

    def getpage(self):

        # Create new cookieobj, necessary to ensure
        # that the cookieobj provided by caller does not undergo
        # unwanted change (i.e., erasure of cookiedict) that occurs
        # with redirects in ClientCookie version as of summer 2002.
        cookieobj = ClientCookie.Cookies()
        if self.cookieobj:
            cookieobj.cookies = self.cookieobj.cookies.copy()
            self.cookieobj = cookieobj
        else:
            self.cookieobj = cookieobj

        # Download pages in a threaded process, so to avoid "lockups".
        # TO DO:  Figure out what to do w/ (i.e., how to kill) threads
        # that don't complete w/in the timeout period.
        timeouts = range(1, self.maxtimeout + 1)
        for threadnbr in range(self.maxthreads):
            if not self.__done:
                if self.debug:  print 'Thread:', threadnbr
                p = PageThread(self.url, self.cookieobj)
                p.start()
                self.pagethreads.append(p)
                for timeout in timeouts:
                    time.sleep(timeout)
                    if self.debug: print 'Timeout:', timeout, 'seconds'
                    if self.pagethreads[threadnbr].done:
                        self.pagestring = \
                                    self.pagethreads[threadnbr].pagestring
                        self.cookieobj = \
                                    self.pagethreads[threadnbr].cookieobj
                        self.__done = 1
                        break

        # Q: Cleanup of "leftover" threads to go here?