[Web-SIG] So what's missing?

Mon Oct 27 10:00:20 EST 2003

On Sun, 26 Oct 2003, Ian Bicking wrote:
> On Sunday, October 26, 2003, at 07:24 AM, John J Lee wrote:
[...]
> Essentially we'd just move HTTPBasicAuthHandler.http_error_401 into
> HTTPHandler.  You could still override it, and HTTPBasicAuthHandler
> would still override it (and somewhat differently, because
> HTTPHandler.http_error_401 should handle both basic and digest auth).
> It's a pretty small change, really.

So is the benefit.  It's

a = HTTPBasicAuthHandler()
a.add_password(user="joe", password="joe")
o = build_opener(a)

vs.

o = build_opener(HTTPHandler(user="joe", password="joe"))

(assuming defaults for realm and uri -- BTW, there seems to be an
HTTPPasswordMgrWithDefaultRealm already, which I guess is some way to what
you want)

If we're still using build_opener, and HTTPBasicAuthHandler were to
override HTTPHandler, it would have to be derived from it.  Not that a
build_opener work-alike couldn't be devised, of course.

[...]
> > I'm still waiting for that example.
>
> I thought I gave examples: documentation, proliferation of classes,
> non-orthogonality of features (e.g., HTTPS vs. HTTP isn't orthogonal to
> authentication).

Lack of documentation doesn't justify changes to the code.  There is not
any harmful proliferation of classes, I think: the function of the
handlers is pretty obvious in most cases (though obviously the docs could
be better).  I don't recognize the orthogonality problem you're referring
to.

[...]
> urlopen('http://whatever.com',
>      username='bob',
>      password='secret',
>      postFields={...},
>      postFiles={'image': ('test.jpg', '... image body ...')},
>      addHeaders={'User-Agent': 'superbot 3000'})
[...]
> write than any OO-based system.  I'm concerned about the external ease
> of use, not the internal conceptual integrity.

OK, maybe I'm overconcerned about this layer -- if it's a simple
convenience thing like this, fine (as long as it actually is useful
and simple, of course).

My biggest concern was that you seemed to be advocating a new UserAgent
class, which would presumably more-or-less duplicate OpenerDirector (you
probably want to skip to the end of this post at this point, because I
think you may have missed a crucial point about that class).
OpenerDirector is not such a great name, actually: maybe UserAgent or
URLOpener would have been better...

> >> authentication information (and it doesn't obey browser URL
> >> conventions, like http://user:password@domain/).
> >
> > What is that convention?  Is it standardised in an RFC?
>
> It's a URL convention that's been around a very long time, I don't know
> if it is in an RFC.
>
> > I see
> > ProxyHandler knows about that syntax.  Obviously it's not an intrinsic
> > limitation of the handler system.
>
> I don't really know how a handler is chosen -- can it figure out
> whether it should use HTTPHandler, HTTPBasicAuthHandler, or
> HTTPDigestAuthHandler just from this URL?  Obviously basic vs. digest
> can't be determined until you try to fetch the object.

The user and password here are for the proxy, not the server (there's some
code duplication here actually, but that's just a bug).  Dunno if that's
standard use of that syntax.

[...]
> > Mind you, if your idea can do the same job as my RFE, then it should
> > certainly be considered alongside that.
>
> Hmm... I just looked at the RFE now, so I'm still not sure what it
> would mean to this.

Sorry, I don't understand 'what it would mean to this'.  What's 'this'?

> >> Yet none of these features
> >> would be all that difficult to add via urlopen or perhaps other simple
> >> functions, (instead of via classes).  I don't think there's any need
> >> for classes in the external API -- fetching URLs is about doing
> >> things,
> >> not representing things, and functions are easier to understand for
> >> doing.
> >
> > Details?  The only example you've given so far involved a UserAgent
> > class.
>
> Details about what?  Your asking for details and examples, but I've
> provided some already and I don't know what you're looking for.

You provided some examples of features you think would require some kind
of layer on top of urllib2.  I thought you were originally suggesting a
new UserAgent class or similar (that was you, wasn't it?).  I don't think
that's necessary.

But in the post I'm replying to here, you gave an example of adding args
to urlopen.  I do agree that something like that could be useful. I think
the docs should be changed here to make it clear that urlopen is just a
convenience function that uses a global OpenerDirector.

[...]
> >> I think fetching and caching are two separate things.  The caching
> >> requires a context.  The fetching doesn't.  I think fetching things
> >
> > The context is provided by the handler.
>
> But we're fetching URLs, not handlers.  The URL is context-less,
> intrinsically.  The handler isn't context-less, but that's part of what
> I don't like about urllib2's handler-oriented perspective.

I don't understand what you just said, but I think we're agreed something
that doesn't require calling build_opener or OpenerDirector.add_handler
could be convenient.

> > [...]
> >> I also don't see how caching would fit very well into the handler
> >> structure.  Maybe there'd be a HTTPCachingHandler, and you'd
> >> instantiate it with your caching policy? (where it stores files, how
> >> many files, etc)  Also a HTTPBasicAuthCachingHandler,
> >> HTTPDigestAuthCachingHandler, HTTPSCachingHandler, and so on?  This
> >> caching is orthogonal -- not just to things like authentication, but
> >
> > My assumption was that it wasn't orthogonal, since RFC 2616 seems to
> > have
> > rather a lot to say on the subject.
>
> Well, if they aren't orthogonal, then they should all be implemented in
> a single class.

Yes.  Off the top of my head, I'd say something like (taking note of your
point below about needing to actually cache responses as well as return
cached data!):

class AbstractHTTPCacheHandler:
    def cached_open(self, request):
        # return cached response, or None if no cache hit
    def cache(self, response):
        # cache response if appropriate

class HTTPCacheHandler(AbstractHTTPCacheHandler):
    http_open = cached_open
    http_response = cache

or, if you want a class that does both HTTP and HTTPS:

class HTTPXCacheHandler(AbstractHTTPCacheHandler):
    https_open = http_open = cached_open
    https_response = http_response = cache

[...]
> Why not have just one good HTTP handler class?

Why would you want one when you can easily do whatever you want with a
convenience function or two, and / or a class derived from OpenerDirector,
or something that works like build_opener, etc.?  Not so easy to go in the
other direction, and separate out the various features of a big,
all-singing all-dancing HTTP handler.  That was a big part of the
motivation for urllib2 in the first place: inflexibility of urllib.

> Many parts of the caching mechanics aren't part of RFC 2616 --
> specifically persistence, metadata storage and querying, and cache
> control.  These aren't part of HTTP at all.

I'll take your word for that, but I admit I don't see where that
causes problems for urllib2.

> > If it *is* (or part of it is) orthogonal, three options come to mind.
> > Let's say you have a cache class.
> >
> > 1. All the normal handlers know about the cache class, but have caching
> >    off by default.
> >
> > 2. Write a CacheHandler with a default_open.  If there's a cache hit,
> >    return it, otherwise return None (let somebody else try to handle
> > it).
> >
> > 3. Subclass (or replace without bothering to subclassing)
> > OpenerDirector.
> >    I guess open is probably what you'd want to change, but I don't know
> >    about HTTP and other protocols' caching rules.
> >
> > I haven't thought it through so I certainly don't claim to know how
> > any of
> > these will turn out (though I'd guess 2. would do the job of any
> > caching
> > that's orthogonal to the various protocol schemes).  If you want to
> > justify a new layer, though, it's up to you to show caching *doesn't*
> > fit
> > urllib2 as-is.  YAGNI.
>
> 1 seems like a lot of trouble.

Doesn't appeal to me either.

> 2 won't work, since CacheHandler can't
> return None and let someone else do the work, because it has to know
> about what the result is so that it can cache the result.

At last, a real problem!  Actually, I think this is a problem already
solved by my 'processors' idea, though perhaps not quite in its current
form -- that should be easy to fix, though (ATM, IIRC, they're separate
from handlers: you can't have an object that is both a handler and a
processor -- and they don't currently have default_request and
default_response methods, either).

> It would
> have to be 3, since it's really about intercepting handler calls.  I
> would imagine that it should wrap OpenerDirector, and perhaps subclass
> it as well.  Then protocols can be added to the caching and non-caching
> directors at the same time.
>
> But it seems like there can be only one OpenDirector... that messes

Nope.  You can have as many as you like, with as many different
implementations as you like.  There is only the inconvenience of having to
cut-n-paste build_opener (certainly build_opener isn't ideal as it is, but
I guess people agree with me that that's a pretty small issue, since
nobody has bothered to finish OpenerFactory).

> things up.  Multiple caches with different policies should be possible.
>   Which leads us back to a separate class that handles caching.
>
> >> even to HTTP (to some degree).  The handler structure doesn't allow
> >> orthogonal features.  Except through mixins, but don't get me started
> >> on mixins...
> >
> > I don't think that's true -- see above.
> >
> > Again, my 'processors' patch is relevant here (see that RFE).  But no
> > point in re-iterating here the long discussion I posted on the SF bug
> > tracker.
>
> I missed that when you posted it.  That might handle some of these
> features.  It seems a little too global to me.  For instance, how would
> you handle two distinct user agents with respect to the referer header?

Two OpenerDirectors!

new_opener = build_opener()
new_opener.addheaders = [("User-agent", "Mozilla/5.0")]

old_opener = build_opener()
old_opener.addheaders = [("User-agent", "Mozilla/4.0")]

new_opener.open("http://www.a.com/")
old_opener.open("http://www.b.com/")

> Seems like it would also make sense as a OpenerDirectory
> subclass/wrapper.

IIRC, there are issues with redirection that prevent that.

> At least portions of it are similar to doing caching
> (like cookies and referers), which is to say a request that is made in
> a specific context.  One example of an application that would require
> separate contexts would be when testing concurrency in a web
> application -- you want to simulate multiple users logging in and
> performing actions concurrently.  You can't do this if the context is
> stored globally.

Perhaps this is all you're missing?  Nothing is global until you use
install_opener.

o = build_opener()  # build OpenerDirector
o.open(url)  # nothing global here, urlopen doesn't know about our opener

install_opener(o)  # install OpenerDirector globally, for use by urlopen
urlopen(url)

John