[Python-Dev] urllib unicode handling

Wed May 7 15:19:41 CEST 2008

I may be missing something, but it seems that RFC 3987 (which is about  
IRIs) basically says:

1) IRIs are identical to URIs except they may have unicode characters  
in them
2) IRIs must be converted to URIs before being used in HTTP
3) The way to convert IRIs to URIs is to UTF-8 encode the unicode  
characters in the IRI and then percent encode the resulting octects  
that are unsafe to have in a URI
4) There's some ambiguity over what to do with the hostname portion of  
the URI if it hash one (IDN, replace non-ascii characters with dashes  
etc)

If this is indeed the case, it sounds perfectly legal (according to  
the RFC) and perfectly practical (as required by numerous popular  
websites) to have urllib.quote and urllib.quote_plus do an automatic  
UTF-8 encoding of unicode strings before percent encoding them.

It's not entirely clear to me if people should be calling urllib.quote  
on hostnames and expecting them to be encoded properly if the hostname  
contains non-ascii characters. Perhaps the docs should be clarified on  
this matter?

Similarly, urllib.unquote should precent-decode characters and then  
attempt to convert the resulting octects from utf-8 to unicode. If  
that conversion fails, we can assume the octects should be returned as  
a byte string rather than a unicode string.

On May 7, 2008, at 8:12 AM, Armin Ronacher wrote:

> Hi,
>
> Jeroen Ruigrok van der Werven <asmodai <at> in-nomine.org> writes:
>
>> Would people object if such functionality got added to urllib?
> I would ;-)  There are IRIs, just that nobody wrote a useful module  
> for that.
> There are algorithms in the RFC that can convert URIs to IRIs and  
> the other way
> round.  IMO that's the way to go.
>
> Regards,
> Armin
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.com