[Python-Dev] urllib.quote and unicode bug resuscitation attempt
Stefan Rank
stefan.rank at ofai.at
Tue Jul 11 15:55:46 CEST 2006
Hi,
urllib.quote fails on unicode strings and in an unhelpful way::
Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> urllib.quote('a\xf1a')
'a%F1a'
>>> urllib.quote(u'ana')
'ana'
>>> urllib.quote(u'a\xf1a')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\urllib.py", line 1117, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\xf1'
There is a (closed) tracker item, dated 2000-10-12,
http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=216716&func=detail
and there was a note added to PEP-42 by Guido.
According to a message I found on quixote-users,
http://mail.mems-exchange.org/durusmail/quixote-users/5363/
it might have worked prior to 2.4.2.
(I guess that this changed because of ascii now being the default encoding?)
BTW, a patch by rhettinger from 8 months or so ago allows urllib.unquote
to operate transparently on unicode strings::
>>> urllib.unquote('a%F1a')
'a\xf1a'
>>> urllib.unquote(u'a%F1a')
u'a\xf1a'
I suggest to add (after 2.5 I assume) one of the following to the
beginning of urllib.quote to either fail early and consistently on
unicode arguments and improve the error message::
if isinstance(s, unicode):
raise TypeError("quote needs a byte string argument, not unicode,"
" use `argument.encode('utf-8')` first.")
or to do The Right Thing (tm), which is utf-8 encoding::
if isinstance(s, unicode):
s = s.encode('utf-8')
as suggested in
http://www.w3.org/International/O-URL-code.html
and rfc3986.
cheers,
stefan
More information about the Python-Dev
mailing list