[Python-Dev] urllib.urlopen() vs IDNs, percent-encoded hosts, ':'
"Martin v. Löwis"
martin at v.loewis.de
Wed Sep 15 23:40:01 CEST 2004
Mike Brown wrote:
> 1. urlopen() cannot reliably process unicode unless there are no
> percent-encoded octets above %7F and no characters above \u007f
> (I think that's the gist of it, at least).
And that feature is by design. URLs are conceptually byte strings,
not character strings, so passing Unicode strings is mostly a
meaningless operation. Mostly - because if the Unicode string is
pure ASCII, it probably matches most implementations and user
expectations to convert it to pure ASCII first, and then treat it
as a URL.
IETF is working on resolving the issue, by introducing IRIs. It
appears that draft-duerst-iri-09.txt is what will become the relevant
RFC. Once the RFC is published, urllib and urllib2 should be updated
to support IRIs; contributions are welcome.
> I don't think this is necessarily a bug, as a proper URI will never contain
> non-ASCII characters. However since urlopen()'s API is unfortunately such that
> it accepts OS-specific filesystem paths, which nowadays may be unicode, it may
> be time to tighten up the API and say that the url argument *must* be a URI,
> and that if unicode is given, it will be converted to str and thus must not
> contain non-ASCII characters.
No. I'ld rather prefer to specify that it if it is a Unicode string, it
must be an IRI, and is converted to an URI according to the IRI spec.
> 2. urlopen() (the URI scheme-specific openers it uses, actually) does not
> percent-decode the host portion of a URL before doing a DNS lookup.
> This wasn't really a problem until IDNs came along; no one was using non-ASCII
> in their hostnames. But now we have to deal with URLs where the host component
> is a string of percent-encoded UTF-8 octets.
Hmm. I think there is no backup in any standard for doing that.
Applications that put URL-escaped UTF-8 bytes into host names deserve to
lose. There are two valid ways for putting non-ASCII characters into the
hostname part of an URL: use Unicode strings, or use IDNA. It may be
that IRIs add another way (I haven't checked this aspect specifically),
but unless there is some RFC supporting such a protocol, any response
by urllib is fine, exceptions preferred.
> Even though IDNs are the main application for percent-encoded octets in the
> host component, it is necessary in simpler cases as well, like
> which would need to be interpreted as
We would have to check: this might be valid usage, but I somewhat doubt
> urllib's urlopeners were *not* updated accordingly. This should be changed.
The change was deliberately deferred until the IRI RFC is published.
> 3. On Windows, urlopen() only recognizes '|' as a Windows drivespec character,
> whereas ':' is just as, if not more, common in 'file' URIs.
I have long ago given up trying to understand this issue. I'm happy to
change this forth and back about once or twice a year, until somebody
comes up with a clear and definitive story, backed up by standards and
product documentation, so that we might get a stable implementation some
day. Feel free to write patches.
More information about the Python-Dev