[Python-Dev] urllib.quote and unicode bug resuscitation attempt

Tue Jul 11 15:55:46 CEST 2006

Hi,

urllib.quote fails on unicode strings and in an unhelpful way::

   Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] 
on win32
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import urllib
   >>> urllib.quote('a\xf1a')
   'a%F1a'
   >>> urllib.quote(u'ana')
   'ana'
   >>> urllib.quote(u'a\xf1a')
   Traceback (most recent call last):
     File "<stdin>", line 1, in ?
     File "C:\Python24\lib\urllib.py", line 1117, in quote
       res = map(safe_map.__getitem__, s)
   KeyError: u'\xf1'

There is a (closed) tracker item, dated 2000-10-12,
http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=216716&func=detail
and there was a note added to PEP-42 by Guido.

According to a message I found on quixote-users,
http://mail.mems-exchange.org/durusmail/quixote-users/5363/
it might have worked prior to 2.4.2.
(I guess that this changed because of ascii now being the default encoding?)

BTW, a patch by rhettinger from 8 months or so ago allows urllib.unquote 
to operate transparently on unicode strings::

   >>> urllib.unquote('a%F1a')
   'a\xf1a'
   >>> urllib.unquote(u'a%F1a')
   u'a\xf1a'

I suggest to add (after 2.5 I assume) one of the following to the 
beginning of urllib.quote to either fail early and consistently on 
unicode arguments and improve the error message::

   if isinstance(s, unicode):
       raise TypeError("quote needs a byte string argument, not unicode,"
                       " use `argument.encode('utf-8')` first.")

or to do The Right Thing (tm), which is utf-8 encoding::

   if isinstance(s, unicode):
       s = s.encode('utf-8')

as suggested in
http://www.w3.org/International/O-URL-code.html
and rfc3986.

cheers,
stefan