[Python-Dev] urllib.quote and unquote - Unicode issues

Wed Jul 30 16:11:40 CEST 2008

Hi folks,

This issue got some attention a few weeks back but it seems to have
fallen quiet, and I haven't had a good chance to sit down and reply
again till now.

As I've said before this is a serious issue which will affect a great
deal of code. However it's obviously not as clear-cut as I originally
believed, since there are lots of conflicting opinions. Let us see if
we can come to a consensus.

(For those who haven't seen the discussion, the thread starts here:
http://mail.python.org/pipermail/python-dev/2008-July/081013.html
continues here for some reason:
http://mail.python.org/pipermail/python-dev/2008-July/081066.html
and I've got a bug report with a fully tested and documented patch here:
http://bugs.python.org/issue3300)

Firstly, it looks like most of the people agree we should add an
optional "encoding" argument which lets the caller customize which
encoding to use. What we tend to disagree about is what the default
encoding should be.

Here I present the various options as I see it (and I'm trying to be
impartial), and the people who've indicated support for that option
(apologies if I've misrepresented anybody's opinion, feel free to
correct):

1. Leave it as it is. quote is Latin-1 if range(0,256), fallback to
UTF-8. unquote is Latin-1.
In favour: Anybody who doesn't reply to this thread
Pros: Already implemented; some existing code depends upon ord values
of string being the same as they were for byte strings; possible to
hack around it.
Cons: unquote is not inverse of quote; quote behaviour
internally-inconsistent; garbage when unquoting UTF-8-encoded URIs.

2. Default to UTF-8.
In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven
Pros: Fully working and tested solution is implemented; recommended by
RFC 3986 for all future schemes; recommended by W3C for use with HTML;
UTF-8 used by all major browsers; supports all characters; most
existing code compatible by default; unquote is inverse of quote.
Cons: By default, URIs may have invalid octet sequences (not possible
to reverse).

3. quote default to UTF-8, unquote default to Latin-1.
In favour: André Malo
Pros: quote able to handle all characters; unquote able to handle all sequences.
Cons: unquote is not inverse of quote; totally inconsistent.

4. quote accepts either bytes or str, unquote default to outputting
bytes unless given an encoding argument.
In favour: Bill Janssen
Pros: Technically does what the spec says, which is treat it as an
octet encoding.
Cons: unquote will break most existing code; almost 100% of the time
people will want it as a string.

</impartiality>

I'll just comment on #4 since I haven't already. Let's talk about
quote and unquote separately. For quote, I'm all for letting it accept
a bytes as well as a str. That doesn't break anything or surprise
anyone.

For unquote, I think it will break a lot and surprise everyone. I
think that while this may be "purely" the best option, it's pretty
silly. I reckon the vast majority of users will be surprised when they
see it spitting out a bytes object, and all that most people will do
is decode it as UTF-8. Besides, while you're reading the RFCs as "URLs
specify a method for encoding octet sequences", I'm reading them as
"URLs specify a method for encoding strings, and leave the character
encoding unspecified." The second reading supports the idea that
unquote outputs a str.

I'm also recommending we add unquote_to_bytes to do what you suggest
unquote should do. (So either way we'll get both versions of unquote;
I'm just suggesting the one called "unquote" do the thing everybody
expects). But that's less of a priority so I want to commit these
urgent fixes first.

I'm basically saying just two things: 1. The standards are undefined;
2. Therefore we should pick the most useful and/or intuitive default.
IMHO choosing UTF-8 *is* the most useful AND intuitive, and will be
more so in the future when more technologies are hard-coded as UTF-8
(which this RFC recommends they do in the future).

I am also quite adamant that unquote be the inverse of quote.

Are there any more opinions on this matter? It would be good to reach
a consensus. If anyone seriously wants to push a different alternative
to mine, please write a working implementation and attach it to issue
3300.

On the technical side of things, does anybody have time to review my
patch for this issue?
http://bugs.python.org/issue3300
Patch 5.
It's just a patch for unquote, quote, and small related functions, as
well as numerous changes to test cases and documentation.

Cheers
Matt