[Python-Dev] urllib unicode handling
thomaspinckney3 at gmail.com
Wed May 7 15:19:41 CEST 2008
I may be missing something, but it seems that RFC 3987 (which is about
IRIs) basically says:
1) IRIs are identical to URIs except they may have unicode characters
2) IRIs must be converted to URIs before being used in HTTP
3) The way to convert IRIs to URIs is to UTF-8 encode the unicode
characters in the IRI and then percent encode the resulting octects
that are unsafe to have in a URI
4) There's some ambiguity over what to do with the hostname portion of
the URI if it hash one (IDN, replace non-ascii characters with dashes
If this is indeed the case, it sounds perfectly legal (according to
the RFC) and perfectly practical (as required by numerous popular
websites) to have urllib.quote and urllib.quote_plus do an automatic
UTF-8 encoding of unicode strings before percent encoding them.
It's not entirely clear to me if people should be calling urllib.quote
on hostnames and expecting them to be encoded properly if the hostname
contains non-ascii characters. Perhaps the docs should be clarified on
Similarly, urllib.unquote should precent-decode characters and then
attempt to convert the resulting octects from utf-8 to unicode. If
that conversion fails, we can assume the octects should be returned as
a byte string rather than a unicode string.
On May 7, 2008, at 8:12 AM, Armin Ronacher wrote:
> Jeroen Ruigrok van der Werven <asmodai <at> in-nomine.org> writes:
>> Would people object if such functionality got added to urllib?
> I would ;-) There are IRIs, just that nobody wrote a useful module
> for that.
> There are algorithms in the RFC that can convert URIs to IRIs and
> the other way
> round. IMO that's the way to go.
> Python-Dev mailing list
> Python-Dev at python.org
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.com
More information about the Python-Dev