[Python-Dev] urllib.quote and unicode bug resuscitation attempt

John J Lee jjl at pobox.com
Tue Jul 11 20:43:22 CEST 2006


On Tue, 11 Jul 2006, Stefan Rank wrote:

> urllib.quote fails on unicode strings and in an unhelpful way::
[...]
>   >>> urllib.quote(u'a\xf1a')
>   Traceback (most recent call last):
>     File "<stdin>", line 1, in ?
>     File "C:\Python24\lib\urllib.py", line 1117, in quote
>       res = map(safe_map.__getitem__, s)
>   KeyError: u'\xf1'

More helpful than silently producing the wrong answer.


[...]
> I suggest to add (after 2.5 I assume) one of the following to the
> beginning of urllib.quote to either fail early and consistently on
> unicode arguments and improve the error message::
>
>   if isinstance(s, unicode):
>       raise TypeError("quote needs a byte string argument, not unicode,"
>                       " use `argument.encode('utf-8')` first.")

Won't this break existing code that catches the KeyError, for no big 
benefit?  If nobody is yet sure what the Right Thing is (see below), I 
think we should not change this yet.


> or to do The Right Thing (tm), which is utf-8 encoding::
>
>   if isinstance(s, unicode):
>       s = s.encode('utf-8')
>
> as suggested in
> http://www.w3.org/International/O-URL-code.html
> and rfc3986.

You seem quite confident of that.  You may be correct, but have you read 
all of the following?  (not trying to claim superior knowledge by asking 
that, I just dunno what the right thing is yet: I haven't yet read RFC 
2617 or got my head around what the unicode issues are or how they should 
apply to the Python stdlib)

http://www.ietf.org/rfc/rfc2617.txt

http://www.ietf.org/rfc/rfc2616.txt

http://en.wikipedia.org/wiki/Percent-encoding

http://mail.python.org/pipermail/python-dev/2004-September/048944.html


Also note the recent discussions here about a module named "uriparse" or 
"urischemes", which fits in to this somewhere.  It would be good to make 
all the following changes in a single Python release (2.6, with luck):

  - extend / modify urllib and urllib2 to handle unicode input

  - address the urllib.quote issue you raise above (+ consider the other
    utility functions in that module)

  - add the urischemes module


In summary, I agree that your suggested fix (and all of the rest I refer 
to above) should wait for 2.6, unless somebody (Martin?) who understands 
all these issues is quite confident your suggested change is OK. 
Presumably the release managers wouldn't allow it in 2.5 anyway.


John


More information about the Python-Dev mailing list