[Web-SIG] So what's missing?

Sun Oct 26 17:39:45 EST 2003

On Sunday, October 26, 2003, at 07:24 AM, John J Lee wrote:
>> c) This doesn't have to effect backward compatibility anyway.  We can
>> leave HTTPBasicAuthHandler in there (deprecated), but also fold it's
>> functionality into HTTPHandler.  HTTPBasicAuthHandler doesn't require
>> that HTTPHandler *not* handle authentication.
>
> Well, it does if you do something important in your auth handler that
> never gets called because HTTPHandler has decided it knows best when it
> comes to 40x.  But like you say, there's probably not much important 
> that
> you could do since password management is already abstracted out.

Essentially we'd just move HTTPBasicAuthHandler.http_error_401 into 
HTTPHandler.  You could still override it, and HTTPBasicAuthHandler 
would still override it (and somewhat differently, because 
HTTPHandler.http_error_401 should handle both basic and digest auth).  
It's a pretty small change, really.

>>> Anyway, it may or may not be the perfect system, but I'm not 
>>> convinced
>>> it needs changing.  Can you give a specific example of where having 
>>> lots
>>> of handlers becomes oppressive?
>>
>> The documentation is certainly a problem (e.g., the
>> HTTPBasicAuthHandler page), though it could be organized differently
>> without changing the code.  It's definitely ravioli code
>> (http://c2.com/cgi/wiki?RavioliCode), with all that entails -- IMHO
>> it's hard to document ravioli code well.  (It's not so important how
>> things are structured internally, but currently urllib2 also exposes
>> that complex class structure)
>
> It's pretty simple conceptually: OpenerDirector asks all the handlers 
> if
> they want to handle, not handle, or abort a response.  It does the same
> for errors.  Most of the handlers' functions are self-explanatory from
> their class names (OK, I guessed CacheFTPHandler wrong, but it was 
> 50-50
> :-).  I wouldn't call that ravioli.

It might work conceptually internally, and probably big internal 
changes aren't necessary.  But it doesn't work conceptually for the 
programmer that has a task-oriented desire.  The programmer starting to 
use urllib2 doesn't want to understand a framework of handlers, they 
want to get something off the net.  urlopen() is the only easy way to 
do that in urllib2, everything else requires a lot more thinking.  And 
urlopen() isn't very featureful.

> I'm still waiting for that example.

I thought I gave examples: documentation, proliferation of classes, 
non-orthogonality of features (e.g., HTTPS vs. HTTP isn't orthogonal to 
authentication).
>
>> Also urlopen is not really extensible.  You can't tell urlopen to use
>
> Not directly, no.  You have to do it via build_opener, or via
> OpenerDirector itself (or another class.  That's probably not ideal: 
> what
> did you have in mind instead?

Maybe keyword arguments that get passed to the handlers.  E.g.:

urlopen('http://whatever.com',
     username='bob',
     password='secret',
     postFields={...},
     postFiles={'image': ('test.jpg', '... image body ...')},
     addHeaders={'User-Agent': 'superbot 3000'})

It could get a little out of hand with all the protocols and all the 
features, but I can't think of a better way to do it.  And I think the 
features would still be easier to document even when urlopen() took all 
sorts of funny options, than they are when there's separate handlers.  
But maybe urllib2 just needs better documentation with useful examples; 
that signature is pretty hairy.  But it's still easier to read and 
write than any OO-based system.  I'm concerned about the external ease 
of use, not the internal conceptual integrity.

>> authentication information (and it doesn't obey browser URL
>> conventions, like http://user:password@domain/).
>
> What is that convention?  Is it standardised in an RFC?

It's a URL convention that's been around a very long time, I don't know 
if it is in an RFC.

> I see
> ProxyHandler knows about that syntax.  Obviously it's not an intrinsic
> limitation of the handler system.

I don't really know how a handler is chosen -- can it figure out 
whether it should use HTTPHandler, HTTPBasicAuthHandler, or 
HTTPDigestAuthHandler just from this URL?  Obviously basic vs. digest 
can't be determined until you try to fetch the object.

>> And we want to add
>> structured POST data to that method (but also allow non-structured
>
> We do?  Why not just have a function (to make file upload data, 
> assuming
> that's what you're thinking of)?

That would work too.

>> data), and cookies, and it might be nice to set the user-agent, and
>> maybe other things that I haven't thought of.  If urlopen doesn't
>> support these extra features then programmers have to learn a new API
>> as their program becomes more complex.
>
> Well, I can do those things already (cookies, set user-agent) using
> urllib2.  User-Agent is a bit ugly, I'll grant you, but I don't lose 
> sleep
> over it.  I did find an extension (backwards-compatible, I hope & 
> believe)
> made things much cleaner -- see the RFE I mentioned earlier.  But no 
> need
> for a whole new layer.
>
> Mind you, if your idea can do the same job as my RFE, then it should
> certainly be considered alongside that.

Hmm... I just looked at the RFE now, so I'm still not sure what it 
would mean to this.

>> Yet none of these features
>> would be all that difficult to add via urlopen or perhaps other simple
>> functions, (instead of via classes).  I don't think there's any need
>> for classes in the external API -- fetching URLs is about doing 
>> things,
>> not representing things, and functions are easier to understand for
>> doing.
>
> Details?  The only example you've given so far involved a UserAgent 
> class.

Details about what?  Your asking for details and examples, but I've 
provided some already and I don't know what you're looking for.  
Example of what?  I don't have an implementation, or any set 
implementation in mind, and I haven't suggested that.

> [...]
>>> So, merely because you think "it feels like a new object", you're
>>> proposing to create a whole new layer of complexity for users to 
>>> learn?
>>> Why should people have to learn a new API just to get caching?  If
>>> somebody had implemented HTTP caching and found the handler mechanism
>>> lacking, or had a specific argument that showed it to be so, a new
>>> layer *might* be justified.  Otherwise, I think it's a bad idea.
>>
>> I think fetching and caching are two separate things.  The caching
>> requires a context.  The fetching doesn't.  I think fetching things
>
> The context is provided by the handler.

But we're fetching URLs, not handlers.  The URL is context-less, 
intrinsically.  The handler isn't context-less, but that's part of what 
I don't like about urllib2's handler-oriented perspective.

> [...]
>> I also don't see how caching would fit very well into the handler
>> structure.  Maybe there'd be a HTTPCachingHandler, and you'd
>> instantiate it with your caching policy? (where it stores files, how
>> many files, etc)  Also a HTTPBasicAuthCachingHandler,
>> HTTPDigestAuthCachingHandler, HTTPSCachingHandler, and so on?  This
>> caching is orthogonal -- not just to things like authentication, but
>
> My assumption was that it wasn't orthogonal, since RFC 2616 seems to 
> have
> rather a lot to say on the subject.

Well, if they aren't orthogonal, then they should all be implemented in 
a single class.  Implementing features in subclasses means that they 
can't be easily used in combination.  Why not have just one good HTTP 
handler class?  It's all one protocol (and HTTPS is exactly the same 
protocol).

Many parts of the caching mechanics aren't part of RFC 2616 -- 
specifically persistence, metadata storage and querying, and cache 
control.  These aren't part of HTTP at all.

> If it *is* (or part of it is) orthogonal, three options come to mind.
> Let's say you have a cache class.
>
> 1. All the normal handlers know about the cache class, but have caching
>    off by default.
>
> 2. Write a CacheHandler with a default_open.  If there's a cache hit,
>    return it, otherwise return None (let somebody else try to handle 
> it).
>
> 3. Subclass (or replace without bothering to subclassing) 
> OpenerDirector.
>    I guess open is probably what you'd want to change, but I don't know
>    about HTTP and other protocols' caching rules.
>
> I haven't thought it through so I certainly don't claim to know how 
> any of
> these will turn out (though I'd guess 2. would do the job of any 
> caching
> that's orthogonal to the various protocol schemes).  If you want to
> justify a new layer, though, it's up to you to show caching *doesn't* 
> fit
> urllib2 as-is.  YAGNI.

1 seems like a lot of trouble.  2 won't work, since CacheHandler can't 
return None and let someone else do the work, because it has to know 
about what the result is so that it can cache the result.  It would 
have to be 3, since it's really about intercepting handler calls.  I 
would imagine that it should wrap OpenerDirector, and perhaps subclass 
it as well.  Then protocols can be added to the caching and non-caching 
directors at the same time.

But it seems like there can be only one OpenDirector... that messes 
things up.  Multiple caches with different policies should be possible. 
  Which leads us back to a separate class that handles caching.

>> even to HTTP (to some degree).  The handler structure doesn't allow
>> orthogonal features.  Except through mixins, but don't get me started
>> on mixins...
>
> I don't think that's true -- see above.
>
> Again, my 'processors' patch is relevant here (see that RFE).  But no
> point in re-iterating here the long discussion I posted on the SF bug
> tracker.

I missed that when you posted it.  That might handle some of these 
features.  It seems a little too global to me.  For instance, how would 
you handle two distinct user agents with respect to the referer header?

Seems like it would also make sense as a OpenerDirectory 
subclass/wrapper.  At least portions of it are similar to doing caching 
(like cookies and referers), which is to say a request that is made in 
a specific context.  One example of an application that would require 
separate contexts would be when testing concurrency in a web 
application -- you want to simulate multiple users logging in and 
performing actions concurrently.  You can't do this if the context is 
stored globally.

--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org