[long] Comments on urllib2 extensions?

Wed Jun 18 09:31:18 EDT 2003

I'm looking for comments on an extension to urllib2 (see below for
code, but read the notes first!).

Apologies for posting here instead of SF patch manger (see below for
why).

Problem with urllib2 as it stands: many things would be nice to
implement as a Handler rather than by overriding methods (and
inevitably duplicating code and increasing fragility), but it's not
always possible to do so.  For example (all from HTTP), automatically
adding Referer headers, handling 200 responses that should have been
non-2xx errors if the server were sane, handling cookies, handling
HTTP-EQUIV headers as if they were normal HTTP headers, automatically
making responses seekable, and following Refresh headers.  I've done
all these things, but I had to duplicate code to do it, because I
don't see how to do it with Handlers.

Two things I'd like to know: 1. will my solution break people's code
2. is there a better way?

For 1., I *think* it shouldn't break code.

For 2., the obvious problem with my solution (below) is that Handlers
are pretty similar to my Processors already.  The thing is, I can't
see how to reimplement these things in terms of Handlers.  First, I
need to *see* all requests (responses) -- I can do that using Handlers
by giving them low (high) .handler_order in Python 2.3 and returning
None from http_open (http_error_xxx).  However, 200 responses never
get seen by http_error_xxx, so that doesn't work (and changing that
would break old code).  Second, I sometimes need to actually modify
the requests and responses.  Sometimes I'd much rather do that by
making a new request or response than modifying the old one in-place
-- and in general, even if I *am* just modifying in-place, I'd still
prefer to explictly return the object than rely on side-effects.  I
suppose just adding a couple of hooks to AbstractHTTPHandler might get
the job done, but I think the increased simplicity of
AbstractHTTPHandler.do_open and the various Processors makes my
approach worthwhile (assuming it actually works & is
backwards-compat., of course...).

The code is below, but note:

I'm not posting this as a patch yet, partly because the core Python
people are likely particularly busy ATM (2.3 etc.), partly because it
*isn't* a patch yet (it's code I just wrote and am wondering whether
or not to put in my ClientCookie package, which extends urllib2).
However, a patch would look almost identical.  It's not even working
code (not much work to get it working... but I think I want to write
some unit tests for urllib2 first).  Note also that some
implementation code isn't reproduced here (notably the Cookies class,
parse_head and seek_wrapper -- all from ClientCookie, and all of which
I hope eventually to get into the standard libray, when they're simple
& stable enough).

You use it like this:

# just pass Processors to build_opener as if they were Handlers
opener = build_opener(FooHandler, BarProcessor, BazHandler)
response = opener.open("http://www.example.com/")

Comments??

<waits for deafening silence>

John

#------------------------------------------------------------------------------
# Note that this is useful also for the case where a urllib2 user wants to
# handle 200 responses that should have been errors (which unfortunately
# does happen -- the content will be something informative like
# "<html>An error occurred.</html>", but the code is still 200).
# XXX *try* this to check following is true!: It's particularly useful
#  when you want to retry fetching pages on certain errors.

# Maybe something similar could be done just by sticking in a hook in
# AbstractHTTPHandler.do_open right at the start (for request munging)
# and just before redirection (for response munging)?  I think this is
# nicer, though.

class BaseProcessor:
    processor_order = 500

    def add_parent(self, parent):
        self.parent = parent
    def close(self):
        self.parent = None
    def __lt__(self, other):
        return self.processor_order < other.processor_order

class HTTPEquivProcessor(BaseProcessor):
    """Append META HTTP-EQUIV headers to regular HTTP headers."""
    def http_response(self, request, response):
        if not hasattr(response, "seek"):
            response = seek_wrapper(response)
        # grab HTTP-EQUIV headers and add them to the true HTTP headers
        headers = response.info()
        for hdr, val in parse_head(response):
            headers[hdr] = val
        response.seek(0)
        return response

class SeekableProcessor(BaseProcessor):
    """Make responses seekable."""
    # XXX perhaps this should come after ErrorProcessor
    # (.processor_order > 1000)
    def http_response(self, request, response):
        if not hasattr(response, "seek"):
            return seek_wrapper(response)
        return response

# XXX really, unverifiable should be an attribute / method on Request -- user
# may want to make unverifiable requests directly
class HTTPCookieProcessor(BaseProcessor):
    """Handle HTTP cookies."""
    def __init__(self, cookies=None):
        if cookies is None:
            cookies = Cookies()
        self.cookies = cookies

    def http_request(self, request):
        if hasattr(request, "error_302_dict") and request.error_302_dict:
            redirect = True
        else:
            redirect = False
            # Stuff request-host of this origin transaction into Request
            # object, because we need to know it to know whether cookies
            # should be in operation during derived requests (redirects,
            # specifically).
            request.origin_req_host = request_host(req)
        self.cookies.add_cookie_header(req, unverifiable=redirect)
        return request

    def http_response(self, request, response):
        if hasattr(request, "error_302_dict") and request.error_302_dict:
            redirect = True
        else:
            redirect = False
        self.cookies.extract_cookies(response, req, unverifiable=redirect)
        return request

class HTTPRefererProcessor(BaseProcessor):
    """Add Referer header to requests.

    This only makes sense if you use each RefererProcessor for a single
    chain of requests only.

    """
    def __init__(self, cookies=None):
        self.referer = None

    def http_request(self, request):
        if ((self.referer is not None) and
            not request.headers.has_key("Referer")):
            request.addheader("Referer", self.referer)
        return request

    def http_response(self, request, response):
        self.referer = response.geturl()
        return response

class HTTPStandardHeadersProcessor(BaseProcessor):
    def http_request(self, request):
        new_req = copy.copy(request)  # For backwards-compat.

        if req.has_data():  # POST
            data = new_req.get_data()
            if not new_req.headers.has_key('Content-type'):
                new_req.add_header('Content-type',
                                   'application/x-www-form-urlencoded')
            if not req.headers.has_key('Content-length'):
                new_req.add_header('Content-length', '%d' % len(data))

        scheme, sel = urllib.splittype(req.get_selector())
        sel_host, sel_path = urllib.splithost(sel)
        h.putheader('Host', sel_host or host)
        for name, value in self.parent.addheaders:
            name = name.capitalize()
            if not req.headers.has_key(name):
                h.putheader(name, value)
        for args in self.parent.addheaders:
            apply(h.putheader, args)

        return new_req

# XXX Problems:
# Existence of ErrorProcessor is necessary for correct hehaviour of
#  the other processors (they need to happen before redirection).
# But:
#  ErrorProcessor and RefreshProcessor must come after the rest.
#  Will it break old code if applied to urllib2?  People who override
#   AbstractHTTPHandler.do_open will do redirection there, which *will*
#   break Processors -- but since all the Processors I'm suggesting
#   (other than ErrorProcessor itself) just add new features and
#   would be off by default, this isn't a problem.  ErrorProcessor is
#   old code in a new form, and I think adds no problems not introduced
#   by overriding do_open in the first place.
#  Response object doesn't include response code or message (because
#   code is always 200).  This isn't a showstopper: could always add
#   code and msg attributes.  End user still only gets 200 responses,
#   because ErrorProcessor ensures that (everything else gets raised
#   as an exception).

class HTTPRefreshProcessor(BaseProcessor):
    """Perform HTTP Refresh redirections.

    Note that if a non-200 HTTP code has occurred (for example, a 30x
    redirect), this processor will do nothing.

    """
    processor_order = 1000

    def http_response(self, request, response):
        code, msg, hdrs = response.code, response.msg, response.info()

        if code == 200 and hdrs.has_key("refresh"):
            refresh = hdrs["refresh"]
            i = string.find(refresh, ";")
            if i != -1:
                time, newurl_spec = refresh[:i], refresh[i+1:]
                i = string.find(newurl_spec, "=")
                if i != -1:
                    if int(time) == 0:
                        newurl = newurl_spec[i+1:]
                        # fake a 302 response
                        hdrs["location"] = newurl
                        response = self.parent.error(
                            'http', request, response, 302, msg, hdrs)

        return response

class HTTPErrorProcessor(BaseProcessor):
    """Process non-200 HTTP error responses.

    This just passes the job on to the Handler.<scheme>_error_<code>
    methods, via the OpenerDirector.error method.

    """
    processor_order = 1000

    def http_response(self, request, response):
        code, msg, hdrs = response.code, response.msg, response.info()

        if code != 200:
            # XXX fp has been replaced by response here -- can that cause
            #   any trouble?  Same goes for RefreshProcessor.
            # XXX redundancy -- response already contains code, msg, hdrs.
            #   Oh well, too bad.
            response = self.parent.error(
                'http', request, response, code, msg, hdrs)

        return None  # no more response processing

class OpenerDirector(urllib2.OpenerDirector):
    # XXX might be useful to have remove_processor, too (say you want to
    #   set a new RefererProcessor, but keep the old CookieProcessor --
    #   could always just create everything anew, though (using old
    #   Cookies object to create CookieProcessor)
    def __init__(self):
        urllib2.OpenerDirector.__init__(self)
        self.processors = []
        self.process_response = {}
        self.process_request = {}

    def add_processor(self, processor):
        added = False
        for meth in dir(processor):
            if meth[-9:] == "_response":
                protocol = meth[:-9]
                if self.process_response.has_key(protocol):
                    self.process_response[protocol].append(processor)
                    self.process_response[protocol].sort()
                else:
                    self.process_response[protocol] = [processor]
                added = True
                continue
            elif meth[-9:] == "_request":
                protocol = meth[:-8]
                if self.process_request.has_key(protocol):
                    self.process_request[protocol].append(processor)
                    self.process_request[protocol].sort()
                else:
                    self.process_request[protocol] = [processor]
                added = True
                continue
        if added:
            self.processors.append(processor)
            # XXX base class sorts .handlers, but I have no idea why
            #self.processors.sort()
            processor.add_parent(self)

    def _request(self, url_or_req, data):
        if isinstance(url_or_req, basestring):
            req = Request(url_or_req, data)
        else:
            # already a urllib2.Request instance
            req = url_or_req
            if data is not None:
                req.add_data(data)
        return req

    def open(self, fullurl, data=None):
        req = self._request(fullurl)
        scheme = req.get_type()

        for meth_name in self.process_request[scheme]:
            meth = getattr(processor, meth_name)
            req = meth(req)
            scheme = req.get_type()  # XXX good / bad / unnecessary?

        response = urllib2.UrlOpener.open(req, data)

        for meth_name in self.process_response[scheme]:
            meth = getattr(processor, meth_name)
            response = meth(req, response)

        return response

    def close(self):
        urllib2.OpenerDirector.close(self)
        for processor in self.processors:
            processor.close()
        self.processors = []

# Note the absence of redirect and header-adding code here
# (AbstractHTTPHandler), and the lack of other clutter that would be
# here without Processors.
class AbstractHTTPHandler(urllib2.BaseHandler):
    def do_open(self, http_class, req):
        host = req.get_host()
        if not host:
            raise URLError('no host given')

        h = http_class(host) # will parse host:port

        if req.has_data():
            h.putrequest('POST', req.get_selector())
        else:
            h.putrequest('GET', req.get_selector())

        for k, v in req.headers.items():
            h.putheader(k, v)
        # httplib will attempt to connect() here.  be prepared
        # to convert a socket error to a URLError.
        try:
            h.endheaders()
        except socket.error, err:
            raise URLError(err)
        if req.has_data():
            h.send(data)

        code, msg, hdrs = h.getreply()
        fp = h.getfile()

        response = addinfourl(fp, hdrs, req.get_full_url())

        return response

# (this wouldn't be in a patch, I'm just deriving from the simplified
# AbstractHTTPHandler)
class HTTPHandler(AbstractHTTPHandler):
    def http_open(self, req):
        return self.do_open(httplib.HTTP, req)
if hasattr(httplib, 'HTTPS'):
    class HTTPSHandler(AbstractHTTPHandler):
        def https_open(self, req):
            return self.do_open(httplib.HTTPS, req)

def build_opener(*handlers):
    """Create an opener object from a list of handlers and processors.

    The opener will use several default handlers and processors, including
    support for HTTP and FTP.

    If any of the handlers passed as arguments are subclasses of the
    default handlers, the default handlers will not be used.
    """
    opener = urllib2.OpenerDirector()
    default_classes = [
        # handlers
        urllib2.ProxyHandler,
        urllib2.UnknownHandler,
        HTTPHandler,  # from this module (derived from new AbstractHTTPHandler)
        urllib2.HTTPDefaultErrorHandler,
        urllib2.HTTPRedirectHandler,
        urllib2.FTPHandler,
        urllib2.FileHandler,
        # processors
        # don't use most processors by default, for backwards compatibility
        #HTTPEquivProcessor,
        #SeekableProcessor,
        #HTTPCookieProcessor,
        #HTTPRefererProcessor,
        HTTPStandardHeadersProcessor,
        #HTTPRefreshProcessor,
        HTTPErrorProcessor
        ]
    if hasattr(httplib, 'HTTPS'):
        default_classes.append(HTTPSHandler)
    skip = []
    for klass in default_classes:
        for check in handlers:
            if type(check) == types.ClassType:
                if issubclass(check, klass):
                    skip.append(klass)
            elif type(check) == types.InstanceType:
                if isinstance(check, klass):
                    skip.append(klass)
    for klass in skip:
        default_classes.remove(klass)

    to_add = []
    for klass in default_classes:
        to_add.append(klass())
    for h in handlers:
        if type(h) == types.ClassType:
            h = h()
        to_add.append(h)

    for instance in to_add:
        # yuck
        if hasattr(instance, "processor_order"):
            opener.add_processor(instance)
        else:
            opener.add_handler(instance)

    return opener
#------------------------------------------------------------------------------