[Python-Dev] urllib unicode handling

Wed May 7 22:04:29 CEST 2008

I was assuming urllib.quote/unquote would only be called on text  
intended to be used in non-hostname portions of the URIs. I'm not sure  
if this is the actual intent of urllib.quote and perhaps the  
documentation should be updated to specify what precisely it does and  
then peopel can decide what parts of URIs it is appropriate to quote/ 
unquote. I don't believe quote/unquote does anything sensical with  
hostnames today that contain non-printable ascii, so this is no loss  
of existing functionality.

Re your suggestion that IRIs should be a separate module: I guess my  
thought is that urllib out of the box should just work with the way  
websites on the web today actually work. Thus, we should make urllib  
do the utf-8 encode / decode rather than make users switch to a  
different module for certain URLs and another library for other URLs.

Re the specific issue of how urllib.unquote should work: Perhaps there  
could be an optional second argument that specified a content encoding  
to use when decoding escaped characters? I would propose that this  
parameter have a default value of utf-8 since that is what most  
websites seem to do, but if the author knew that the website they were  
using encoded URLs in iso-8559 then they could unquote using that  
scheme.

On May 7, 2008, at 3:10 PM, Martin v. Löwis wrote:

>> If this is indeed the case, it sounds perfectly legal (according to  
>> the
>> RFC) and perfectly practical (as required by numerous popular  
>> websites)
>> to have urllib.quote and urllib.quote_plus do an automatic UTF-8
>> encoding of unicode strings before percent encoding them.
>
> It's probably legal, but I don't understand why you think it's
> practical. The DNS lookup then will certainly fail, no?
>
> Regards,
> Martin