urllib, urllib2, httplib -- Begging for consolidation?

Wed Jun 5 10:30:48 EDT 2002

On Tue, 4 Jun 2002 brueckd at tbye.com wrote:
[...]
> > I suppose the standard answer applies here: it isn't there because
> > nobody has written it yet.
>
> Argh... this misses the whole point of the thread. Never mind, I repent
> for having revived this topic.

I'm trying to understand your posts on this, which are interesting. I just
ported the HTTP cookie handling code from libwww-perl (and integrated it
into urllib2), and since I seem to have got into this HTTP business, I may
(eventually, not for a month or two) experiment with adding features to /
restructuring urllib2 & co.  Having read the original thread (your posts,
at least), I still think we are (mostly) agreeing violently.  If not,
apologies for my slowness.

[quotes pasted from several different postings, all by Dave]
> If there's some division of the feature set (base protocol and higher
> level) it would make sense for the two modules to be httpcore and
> httplib (or something) with urllib built on top.

OK, my main point here: the code is already structured like this; the only
difference between your view of how it should be and how it actually is is
that the conceptual units you would like to label httplib and urllib
happen to live in the same module in the current implementation:

httpcore --> httplib
httplib --> urllib2.AbstractHTTPHandler (which is missing features)
urllib --> urllib2.OpenerDirector, urllib2.build_opener, urllib2.urlopen

But from your last post, it seems I may have completely misunderstood.
Perhaps part of your point is simply that the division into modules should
better reflect the conceptual structure?  The current organisation may (or
may not) have been a mistake, but it's not a significant enough mistake to
warrant reorganisation, is it?

> That's fine, but why would so much useful http-specific knowledge live
> outside of an http-specific module?

Because it's possible to separate out the very basic, minimal HTTP stuff
into httplib, it was the right decision to do so, from an implementation
point of view.  From a use point of view, since HTTP is stateless, it's
always possible to get at the httplib.HTTP interface even when using
urllib2 (I guess that's false for HTTP 1.1, though, which hadn't occurred
to me until you pointed it out).  Are we agreed on this?

A particular example you raised:

> Ok, maybe I'm just not understanding how httplib/urllib/urllib2 work
> together. For example, what is the correct way to do a HTTP HEAD
> request that follows redirects? It's not hard, but it's silly to have
> to code it yourself if somebody already did the work for you. Well,
> httplib doesn't know how to follow an HTTP redirect, but lo and
> behold, urllib does. Unfortunately, there's no way to use it because
> in this case because urllib is for opening URLs (the GET is hardcoded
> and not easily overridable).

We agree on the problem here.  However, I don't see why the basic
architecture of urllib2 is wrong here.  HTTPRedirectHandler can be used
entirely separately from the OpenerDirector machinery, and is documented
in the std. lib. docs.  You are of course right in saying that it fails
completely to handle HEAD requests and cookies, for example.  *However*,
this only has the undesirable effect of punting you back to httplib (which
I think we're agreed is the problem we're discussing here) if you happen
to think that that'll be less work than modifying HTTPRedirectHandler,
which is an almost entirely self-contained class.  Obviously, it would be
better if this job were done once and for all in the proper manner, by
fixing HTTPRedirectHandler to handle HEAD methods (assuming that HEAD
methods are supposed to get redirected), but the architecture is all there
-- we would just have to use the Request object to see if we're doing a
HEAD or a POST, and act accordingly.

To reiterate, the framework is there, the code isn't.

Interested in any more ideas you may have -- the more concrete, the
better!

John