[Python-bugs-list] [ python-Feature Requests-759792 ] Make urllib2 more extensible (patch)
SourceForge.net
noreply@sourceforge.net
Thu, 31 Jul 2003 15:15:24 -0700
Feature Requests item #759792, was opened at 2003-06-24 13:16
Message generated for change (Comment added) made by jhylton
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=759792&group_id=5470
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: John J Lee (jjlee)
Assigned to: Nobody/Anonymous (nobody)
Summary: Make urllib2 more extensible (patch)
Initial Comment:
Problem with urllib2 as it stands: many things would be
nice to implement as a handler rather than by overriding
methods (and inevitably duplicating code and increasing
fragility), but it's not always possible to do so. For
example (all from HTTP), automatically adding Referer
headers, handling 200 responses that should have been
non-2xx errors if the server were sane, handling cookies,
handling HTTP-EQUIV headers as if they were normal
HTTP headers, automatically making responses
seekable, and following Refresh headers. I've done all
these things, but I had to duplicate code to do it,
because I don't see how to do it with handlers. I've now
rewritten this code by adding a 'processor' scheme to
urllib2 (I'm *not* using 'scheme' in the technical URL
sense here!).
Processors work quite similarly to handlers, except that
1. They always *all* get run (rather than just the first to
handle a request or response -- unlike handlers).
2. The methods that get called on processors are of the
form <proto>_request and <proto>_response, and are
called, respectively, immediately before and immediately
after the normal OpenerDirector.open machinery.
http_request, for example, gets called on all processors
before, and pre-processes HTTP requests; http_response
post-processes HTTP responses.
3. <proto>_request methods return request objects, and
<proto>_response methods return response objects.
4. Even 200 responses get processed.
You use it like this:
# just pass processors to build_opener as if they were
handlers
opener = build_opener(FooHandler, BarProcessor,
BazHandler)
response = opener.open("http://www.example.com/")
I've reimplemented all my stuff (the features listed in the
first paragraph, above) in terms of this scheme, and it all
seems to be working fine (but no unit tests yet). So, the
scheme does achieve the extensibility it aims for. The
patch I've attached here doesn't include most of those
features -- the only new functionality it adds is an
HTTPRefererProcessor. If this gets accepted, I intend to
submit patches adding new processors for cookie
handling etc. later.
Two things I'd like to know: 1. will my solution break
people's code 2. is there a better way?
For 1., I *think* it shouldn't break code.
For 2., the obvious problem with my solution (below) is
that handlers are pretty similar to my processors already.
The thing is, I can't see how to reimplement these things
in terms of handlers. First, I need to *see* all requests
(responses) -- I can do that using handlers by giving
them low (high) .handler_order in Python 2.3 and
returning None from http_open (http_error_xxx).
However, 200 responses never get seen by
http_error_xxx, so that doesn't work (and changing that
would break old code). Second, I need to actually
modify the requests and responses. Sometimes I'd much
rather do that by making a new request or response than
modifying the old one in-place (redirections, for
example) -- and in general, even if I *am* just modifying
in-place, I'd still prefer to explictly return the object than
rely on side-effects. Perhaps just adding a couple of
hooks to AbstractHTTPHandler might get these jobs
done, but I think the increased simplicity of
AbstractHTTPHandler.do_open and the various
processors makes my approach worthwhile (assuming it
actually works & is backwards-compat., of course...).
Comments?
A few notes:
Some headers (Content-Length, Referer, ...) mustn't be
copied to requests for a redirected URL. This requires
the addition of a new dict to Request. I've added an
add_unredirected_headers method, too. The old
implementation just sends these headers directly, but
that's not possible if you want to use procesors to
implement these things.
The current response object (httplib.HTTPResponse,
wrapped with urllib.addinfourl) doesn't include response
code or message (because code is always 200). The
patch just assigns .code and .msg attributes (maybe they
should be methods, for uniformity with the rest of the
response interface).
Backwards-compatibility notes:
People who override AbstractHTTPHandler.do_open will
do non-200 response handling there, which will break
processors, but that's a forwards-compat. issue. I don't
think the existence of overridden do_open methods in old
code should be a problem for backwards-compatibility.
Note that, though processors see all responses, the end
user still only gets 200 responses returned.
ErrorProcessor ensures that by passing non-200
responses on to the existing urllib2 error machinery.
John
----------------------------------------------------------------------
>Comment By: Jeremy Hylton (jhylton)
Date: 2003-07-31 22:15
Message:
Logged In: YES
user_id=31392
In principle, I'm in favor of this. I'd like to take some
time to review the design and code.
----------------------------------------------------------------------
Comment By: John J Lee (jjlee)
Date: 2003-07-08 15:13
Message:
Logged In: YES
user_id=261020
I just noticed the patch breaks on https. Trivially fixed by
adding lines like https_request = http_request to the various
processor classes.
Also, another use case: gzip Content-encoding.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=759792&group_id=5470