[Python-Dev] urllib.urlopen() vs IDNs, percent-encoded hosts, ':'

Thu Sep 16 08:37:39 CEST 2004

Mike Brown wrote:
> No. The intent is actually that a URI is (not conceptually, just *is*) a 
> string of characters

You are right: URIs are meant to be written on paper. However, RFC 2396
also acknowledges that the issue of non-ASCII characters is unresolved.
It suggests (in 2.1) that the URI scheme should specify the
interpretation of byte values.

> This was actually clear in RFC 2396 sections 1.5 and 2, but has been explained 
> somewhat better in the rephrased section 2 of rfc2396bis, which is in Last 
> Call.

This suggests that new URI schemes should mandate UTF-8 in the
components, but is silent on the issue of existing schemes.

> The question is, does the url argument to urlopen() purport to be or is it 
> assumed to be a URL? The function is quite lenient about what it accepts as a 
> URL -- it accepts pretty much anything you give it, be it unicode or str, with 
> or without a scheme component, relative to some unknown base, and loaded with 
> illegal characters, and it tries to deal with it as best it can -- yet it 
> still rejects or inconsistently handles some valid URIs, and this is what I 
> want to see changed.

If something passed to it is clearly a valid URL, and there is a clear
definition of how a computer should process it, and urllib doesn't, than
this is certainly a bug and should be fixed. Can you give an example of
such a URL?

> Perhaps I should rephrase part of the issue this way: If the argument to 
> urlopen() is assumed to be a URI, then %FF in the argument should not be 
> interpreted any differently when the argument is a str vs when it is unicode. 

Certainly. Indeed, urllib makes no difference, AFAICT.
"http://localhost/%FF" and u"http://localhost/%FF" are processed in
the same way.

> RFC 2396 left it ambiguous as to what characters are represented by %80-%FF, 
> so an implementation thereof may make such interpretations as it pleases.
> The current implementation doesn't do this in a consistent manner.

No. RFC 2396 defers the specifications to the specific schema.

>>Applications that put URL-escaped UTF-8 bytes into host names deserve to
>>lose.
> 
> 
> Come February or whenever rfc2396bis and the IRI draft become RFCs, that
> will no longer be a position you can maintain.

I see. I think I could accept a patch in this direction for
Python 2.4 even if RFC2396bis isn't published, assuming the patch
arrives before 2.4b1.

> Let me be clear though - I am not suggesting getting rid of support for '|'.
> I am merely saying that there is no reason ':' should, on Windows, fail to
> be treated the same as '|' for the purpose of representing the ':' in a
> drivespec.

I know that I personally won't touch this code, except for applying
patches. So if you have a clear vision of what needs to be changed
and how, submit a patch.

As for using regular expressions in the standard library: It seems you
believe this is discouraged. I don't know why you think so - I've never
heard of such a constraint before (in general - in specific cases,
submitters may have been told that alternatives are more efficient).

Regards,
Martin