urllib, urllib2, httplib -- Begging for consolidation?

Wed May 8 10:57:17 EDT 2002

On 8 May 2002, Paul Boddie wrote:

> > BUT, if there were to occur some sort of consolidation (meaning, 
> > introducing incompatibilities or a whole new module), then we should use 
> > that as an opportunity to restructure/redesign that whole set of modules 
> > because, IMO, they've evolved past their original design. If we can come 
> > up with a good organization, the actual implementation could be handled by 
> > various members of the community.
> 
> I think we should stick with the urlopen concept because it's very
> powerful - just open a URL and pretend that it's a file.

Oh, I wasn't saying that it's not powerful (it is), just that it's not too
commonly useful. Obviously YMMV, but for me it has been pretty rare to
want to open some generic resource and read it. I'm not arguing that we
should get rid if this functionality, just observing that for me it has
never been the common case, and doing what has been the common case (http
connection + extra headers + some data to post) is always more work than 
it needs to be.

> The clever design will arise when specialised features of various
> protocols need to be specified whilst using the general interface,

But if you're specifying features specific to a certain protocol, why use 
the general interface to do it? That makes the general interface hacky and 
cluttered. My argument is that, right now, people use the general 
interface (urllib) not because they don't know what type of URL they're 
opening (ftp/http/file/etc) but because the modules somewhat discourage 
using the other ones. 

I don't want to make it sound like too big of a deal, but the use model
today doesn't make sense: today if a newbie wants the easiest way to open
an HTTP URL, he should use urllib. If he wants to do something a little
more complex, he should scrap the urllib code and use httplib. Like I
said, it's not too big of a deal, but it makes more sense if moving from
the simple to the more complex case is incremental and based on the same
code.

> then there are plenty of other packages which deal with this kind of
> problem; for example, the DB-API has ways of allowing database-specific
> functionality to be specified when opening database connections.

Actually, this is a great example of what I'm saying we need! :) The DB 
API does _not_ provide you a way to open an unknown database type, but a 
common way to operate on a database connection once you have one. It would 
be "powerful" if the DB API let you pass in a string that, among other 
things, included "oracle", "gadfly", "mysql", etc. to denote database type 
and it would then go connect to it, but such a feature wouldn't be very 
*useful* because in practice you almost always *do* know what database 
type you're connecting to. So, you use a specific database module (e.g. 
DCOracle2) to get a database connection (to which you can pass all sorts 
of custom information to), after which you can use the connection in a 
pretty generic way. At the same time, however, the connection object can 
still expose additional vendor-specific functionality in addition to what 
is specified in the DB API.

A similar approach might work well for the different protocol libraries - 
go to the appropriate module to open the one you want (setting it up with 
any protocol-specific information), after which you have a file-like 
object that your code can use generically. Note that on top of all this 
somebody could still have the urllib functionality that takes a generic 
URL, figures out the appropriate protocol, and returns the correct 
"connection" object for your code to use, but such a top-level function 
would *not* be the place to start adding protocol-specific options.

One last thing: it's enticing to try to come up with a generic model for 
protocols like the DB API, but we can't take that analogy too far because 
with the DB's the differentiating factor is mostly just the vendor and the 
connection step, after which the database will be used often identically 
regardless of vendor.

With network protocols, however, there is much less overlap in both
functionality and how you'd use them (and rightfully so since they are
different protocols built to serve different purposes!) And that's the 
whole reason why a generalized interface is nifty but less useful - the 
protocols were built to do different things so trying to use them all the 
same way is essentially dumbing them all down near the level of just a 
file (and it's ok to have such a dumbing down function, but it's just not 
the common case).

> I strongly disagree that URL manipulation should be accessed through a
> HTTP-specific module - the last thing we need is a "beware of the
> leopard" situation in the standard library (where things are tucked away
> in obscure or bizarre places, depending on the context of the enquiry).

I agree completely. Perhaps I wasn't very clear but I was advocating a 
consolidation of url handling into a specific module, that is possibly 
accessbile through the different protocol modules too. On second thought, 
that's a dumb idea - it should all be in it's own module. Due to evolution 
we have the opposite today: people look in urllib/httplib instead of 
urlparse and vice versa, for example.

> > We could take the same approach with other protocols, and include modules 
> > for FTP, plain files, etc. With all those in place we could still have the 
> > "open any type of URL" routine built on top, but it should work only for 
> > the simplest of use cases; if you need something more complex then you'd 
> > go use the corresponding protocol library yourself.
> 
> The key to this exercise is making the uncommon case almost as easy to
> handle as the common case so that one doesn't necessarily need to
> learn a completely new framework in order to get that 1% of
> functionality that the common case doesn't deal with.

I think I get your point, but I'd state it as "the key is to try to have
each decreasingly common case build on the previous case" (so you don't
have to relearn and you don't have to toss out work already done).

-Dave