[Python-Dev] urllib unicode handling

Wed May 7 18:11:34 CEST 2008

Maybe I didn't understand the RFC quite right, but it seemed like how  
to handle hostnames was left as a choice between IDNA encoding the  
hostname or replacing the non-ascii characters with dashes? I guess in  
practice IDNA is the right decision.

Another part I wasn't clear on is whether urllib.quote() understands  
it's working on URIs, arbitrary strings, URLs or what. It seems that  
from the documentation it looks like it's expecting to just work on  
the path component of URLs. If this is so, then it doesn't need to  
understand what to do if the IRI contains a hostname.

Seems like the other somewhat under-specified part of all of this is  
how urllib.unquote() should work. If after percent decoding it sees  
non-ascii octets, should it try to decode them as utf-8 and if that  
fails then leave them as is?

On May 7, 2008, at 11:55 AM, Robert Brewer wrote:

> "Martin v. Löwis" wrote:
>> The proper way to implement this would be IRIs (RFC 3987),
>> in particular section 3.1. This is not as simple as just
>> encoding it as UTF-8, as you might have to apply IDNA to
>> the host part.
>>
>> Code doing so just hasn't been contributed yet.
>
> But if someone wanted to do so, it's pretty simple:
>
>>>> u'www.\u212bngstr\xf6m.com'.encode("idna")
> 'www.xn--ngstrm-hua5l.com'
>
>
> Robert Brewer
> fumanchu at aminus.org
>