[Python-bugs-list] [ python-Feature Requests-759792 ] Make urllib2 more extensible (patch)

Thu, 31 Jul 2003 15:15:24 -0700

Feature Requests item #759792, was opened at 2003-06-24 13:16
Message generated for change (Comment added) made by jhylton
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=759792&group_id=5470

Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: John J Lee (jjlee)
Assigned to: Nobody/Anonymous (nobody)
Summary: Make urllib2 more extensible (patch)

Initial Comment:
Problem with urllib2 as it stands: many things would be 
nice to implement as a handler rather than by overriding 
methods (and inevitably duplicating code and increasing 
fragility), but it's not always possible to do so.  For 
example (all from HTTP), automatically adding Referer 
headers, handling 200 responses that should have been 
non-2xx errors if the server were sane, handling cookies, 
handling HTTP-EQUIV headers as if they were normal 
HTTP headers, automatically making responses 
seekable, and following Refresh headers.  I've done all 
these things, but I had to duplicate code to do it, 
because I don't see how to do it with handlers.  I've now 
rewritten this code by adding a 'processor' scheme to 
urllib2 (I'm *not* using 'scheme' in the technical URL 
sense here!). 

Processors work quite similarly to handlers, except that 

1. They always *all* get run (rather than just the first to 
handle a request or response -- unlike handlers). 

2. The methods that get called on processors are of the 
form <proto>_request and <proto>_response, and are 
called, respectively, immediately before and immediately 
after the normal OpenerDirector.open machinery.  
http_request, for example, gets called on all processors 
before, and pre-processes HTTP requests; http_response 
post-processes HTTP responses. 

3. <proto>_request methods return request objects, and 
<proto>_response methods return response objects. 

4. Even 200 responses get processed. 

You use it like this: 

# just pass processors to build_opener as if they were 
handlers 
opener = build_opener(FooHandler, BarProcessor, 
BazHandler) 
response = opener.open("http://www.example.com/") 

I've reimplemented all my stuff (the features listed in the 
first paragraph, above) in terms of this scheme, and it all 
seems to be working fine (but no unit tests yet).  So, the 
scheme does achieve the extensibility it aims for.  The 
patch I've attached here doesn't include most of those 
features -- the only new functionality it adds is an 
HTTPRefererProcessor.  If this gets accepted, I intend to 
submit patches adding new processors for cookie 
handling etc. later. 

Two things I'd like to know: 1. will my solution break 
people's code 2. is there a better way? 

For 1., I *think* it shouldn't break code. 

For 2., the obvious problem with my solution (below) is 
that handlers are pretty similar to my processors already.  
The thing is, I can't see how to reimplement these things 
in terms of handlers.  First, I need to *see* all requests 
(responses) -- I can do that using handlers by giving 
them low (high) .handler_order in Python 2.3 and 
returning None from http_open (http_error_xxx).  
However, 200 responses never get seen by 
http_error_xxx, so that doesn't work (and changing that 
would break old code).  Second, I need to actually 
modify the requests and responses.  Sometimes I'd much 
rather do that by making a new request or response than 
modifying the old one in-place (redirections, for 
example) -- and in general, even if I *am* just modifying 
in-place, I'd still prefer to explictly return the object than 
rely on side-effects.  Perhaps just adding a couple of 
hooks to AbstractHTTPHandler might get these jobs 
done, but I think the increased simplicity of 
AbstractHTTPHandler.do_open and the various 
processors makes my approach worthwhile (assuming it 
actually works & is backwards-compat., of course...). 

Comments? 

A few notes: 

Some headers (Content-Length, Referer, ...)  mustn't be 
copied to requests for a redirected URL.  This requires 
the addition of a new dict to Request.  I've added an 
add_unredirected_headers method, too.  The old 
implementation just sends these headers directly, but 
that's not possible if you want to use procesors to 
implement these things. 

The current response object (httplib.HTTPResponse, 
wrapped with urllib.addinfourl) doesn't include response 
code or message (because code is always 200).  The 
patch just assigns .code and .msg attributes (maybe they 
should be methods, for uniformity with the rest of the 
response interface). 

Backwards-compatibility notes: 

People who override AbstractHTTPHandler.do_open will 
do non-200 response handling there, which will break 
processors, but that's a forwards-compat. issue.  I don't 
think the existence of overridden do_open methods in old 
code should be a problem for backwards-compatibility. 

Note that, though processors see all responses, the end 
user still only gets 200 responses returned.  
ErrorProcessor ensures that by passing non-200 
responses on to the existing urllib2 error machinery. 

John 

----------------------------------------------------------------------

>Comment By: Jeremy Hylton (jhylton)
Date: 2003-07-31 22:15

Message:
Logged In: YES 
user_id=31392

In principle, I'm in favor of this.  I'd like to take some 
time to review the design and code.

----------------------------------------------------------------------

Comment By: John J Lee (jjlee)
Date: 2003-07-08 15:13

Message:
Logged In: YES 
user_id=261020

I just noticed the patch breaks on https.  Trivially fixed by 
adding lines like https_request = http_request to the various 
processor classes. 

Also, another use case: gzip Content-encoding. 

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=759792&group_id=5470