[Python-Dev] urllib.quote and unquote - Unicode issues

Sat Jul 12 19:27:16 CEST 2008

Hi all,

My first post to the list. In fact, first time Python hacker, long-time
Python user though. (Melbourne, Australia).

Some of you may have seen for the past week or so my bug report on Roundup,
http://bugs.python.org/issue3300

I've spent a heap of effort on this patch now so I'd really like to get some
more opinions and have this patch considered for Python 3.0.

Basically, urllib.quote and unquote seem not to have been updated since
Python 2.5, and because of this they implicitly perform Latin-1 encoding and
decoding (with respect to percent-encoded characters). I think they should
default to UTF-8 for a number of reasons, including that's what other
software such as web browsers use.

I've submitted a patch which fixes quote and unquote to use UTF-8 by
default. I also added extra arguments allowing the caller to choose the
encoding (after discussion, there was some consensus that this would be
beneficial). I have now completed updating the documentation, writing
extensive test cases, and testing the rest of the standard library for code
breakage - with the result being there wasn't really any, everything seems
to just work nicely with UTF-8. You can read the sordid details of my
investigation in the tracker.

Firstly, it'd be nice to hear if people think this is desirable behaviour.
Secondly, if it's feasible to get this patch in Python 3.0. (I think if it
were delayed to Python 3.1, the code breakage wouldn't justify it). And
thirdly, if the first two are positive, if anyone would like to review this
patch and check it in.

I have extensively tested it, and am now pretty confident that it won't
cause any grief if it's checked in.

Thanks very much,
Matt Giuca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20080713/d6f74f48/attachment.htm>