[Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...)

Fri Sep 17 08:07:30 CEST 2004

Mike Brown wrote:
> It is true that we are under no obligation in our API to assume a one-to-one 
> mapping between the characters in a unicode argument and the characters in the 
> resource-identifying string that, in turn, may or may not be a URL, but to do 
> otherwise seems a bit unintuitive, to me.

Not at all. If the URI contains the sequence '%A0', does that constitute
one or three characters? You suggested earlier that the host part of an
URI could be UTF-8 encoded. In that case, a single character translates
into, say, 2 octets, which then get %-escaped, translating into 6 ASCII
characters. So a single Unicode character may end up in multiple ASCII
characters during processing.

> You seem to be suggesting that a 
> one-to-one mapping be assumed until a syntax error is found. Then, if the 
> syntax error is of a certain type (like the character is > U+007F, then you 
> seem to be saying that you want some kind of cleanup to be performed in order 
> to ensure that the resulting string is conformant to the URL syntax.
 >
> I feel that since urllib is under no obligation to assume anything about what 
> the syntax-violating characters are intended to mean, it would be within its 
> rights to reject the argument altogether, and I would rather see it do that 
> than try to guess what the user intended -- especially in this domain, where 
> such guesses, if wrong, only lead developers to be even more confused about 
> topics that are already barely understood as it is.

Either is fine. It appears that the future URI RFC and the IRI RFC will
suggest that the "cleanup" is the right action, and that the
implementation should indeed process the string.

> To me, convenience afforded by the automatic
> percent-encoding is outweighed by the harm introduced by the wrong guesses
> and the reinforcement of the belief in the document author or developer that
> a URI reference is whatever string of characters they want it to be.

I agree. However, I hope that the IRI RFC will resolve the issue for
good, at least when the input is a Python Unicode string. When the input
is a Python byte string, it seems natural to %-escape the non-ASCII
bytes.

> But if we are going to accept arbitrary strings and then attempt to make 'em 
> fit the URL syntax, then we should, IMHO, acknowledge (in API documentation) 
> that this is behavior provided for the sake of having a convenient API, and is 
> not within the scope of the standards. Hopefully the marginal percentage of 
> developers who actually read the API docs can then learn that 
> u'http://m.v.l\xd6wis/' is not a URL, even if urllib happens to convert it to 
> one, and in my perfect fantasy-world, they'd be less inclined to give us any
> reason to make lenient APIs. 

But it is an IRI reference, isn't it? I think urllib then should process
it as such.

Regards,
Martin