[Python-Dev] urllib.urlopen() vs IDNs, percent-encoded hosts, ':'
"Martin v. Löwis"
martin at v.loewis.de
Thu Sep 16 08:37:39 CEST 2004
Mike Brown wrote:
> No. The intent is actually that a URI is (not conceptually, just *is*) a
> string of characters
You are right: URIs are meant to be written on paper. However, RFC 2396
also acknowledges that the issue of non-ASCII characters is unresolved.
It suggests (in 2.1) that the URI scheme should specify the
interpretation of byte values.
> This was actually clear in RFC 2396 sections 1.5 and 2, but has been explained
> somewhat better in the rephrased section 2 of rfc2396bis, which is in Last
> Call.
This suggests that new URI schemes should mandate UTF-8 in the
components, but is silent on the issue of existing schemes.
> The question is, does the url argument to urlopen() purport to be or is it
> assumed to be a URL? The function is quite lenient about what it accepts as a
> URL -- it accepts pretty much anything you give it, be it unicode or str, with
> or without a scheme component, relative to some unknown base, and loaded with
> illegal characters, and it tries to deal with it as best it can -- yet it
> still rejects or inconsistently handles some valid URIs, and this is what I
> want to see changed.
If something passed to it is clearly a valid URL, and there is a clear
definition of how a computer should process it, and urllib doesn't, than
this is certainly a bug and should be fixed. Can you give an example of
such a URL?
> Perhaps I should rephrase part of the issue this way: If the argument to
> urlopen() is assumed to be a URI, then %FF in the argument should not be
> interpreted any differently when the argument is a str vs when it is unicode.
Certainly. Indeed, urllib makes no difference, AFAICT.
"http://localhost/%FF" and u"http://localhost/%FF" are processed in
the same way.
> RFC 2396 left it ambiguous as to what characters are represented by %80-%FF,
> so an implementation thereof may make such interpretations as it pleases.
> The current implementation doesn't do this in a consistent manner.
No. RFC 2396 defers the specifications to the specific schema.
>>Applications that put URL-escaped UTF-8 bytes into host names deserve to
>>lose.
>
>
> Come February or whenever rfc2396bis and the IRI draft become RFCs, that
> will no longer be a position you can maintain.
I see. I think I could accept a patch in this direction for
Python 2.4 even if RFC2396bis isn't published, assuming the patch
arrives before 2.4b1.
> Let me be clear though - I am not suggesting getting rid of support for '|'.
> I am merely saying that there is no reason ':' should, on Windows, fail to
> be treated the same as '|' for the purpose of representing the ':' in a
> drivespec.
I know that I personally won't touch this code, except for applying
patches. So if you have a clear vision of what needs to be changed
and how, submit a patch.
As for using regular expressions in the standard library: It seems you
believe this is discouraged. I don't know why you think so - I've never
heard of such a constraint before (in general - in specific cases,
submitters may have been told that alternatives are more efficient).
Regards,
Martin
More information about the Python-Dev
mailing list