Re: [Python-Dev] urllib.quote and unquote - Unicode issues

Hi folks, This issue got some attention a few weeks back but it seems to have fallen quiet, and I haven't had a good chance to sit down and reply again till now. As I've said before this is a serious issue which will affect a great deal of code. However it's obviously not as clear-cut as I originally believed, since there are lots of conflicting opinions. Let us see if we can come to a consensus. (For those who haven't seen the discussion, the thread starts here: http://mail.python.org/pipermail/python-dev/2008-July/081013.html continues here for some reason: http://mail.python.org/pipermail/python-dev/2008-July/081066.html and I've got a bug report with a fully tested and documented patch here: http://bugs.python.org/issue3300) Firstly, it looks like most of the people agree we should add an optional "encoding" argument which lets the caller customize which encoding to use. What we tend to disagree about is what the default encoding should be. Here I present the various options as I see it (and I'm trying to be impartial), and the people who've indicated support for that option (apologies if I've misrepresented anybody's opinion, feel free to correct): 1. Leave it as it is. quote is Latin-1 if range(0,256), fallback to UTF-8. unquote is Latin-1. In favour: Anybody who doesn't reply to this thread Pros: Already implemented; some existing code depends upon ord values of string being the same as they were for byte strings; possible to hack around it. Cons: unquote is not inverse of quote; quote behaviour internally-inconsistent; garbage when unquoting UTF-8-encoded URIs. 2. Default to UTF-8. In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven Pros: Fully working and tested solution is implemented; recommended by RFC 3986 for all future schemes; recommended by W3C for use with HTML; UTF-8 used by all major browsers; supports all characters; most existing code compatible by default; unquote is inverse of quote. Cons: By default, URIs may have invalid octet sequences (not possible to reverse). 3. quote default to UTF-8, unquote default to Latin-1. In favour: André Malo Pros: quote able to handle all characters; unquote able to handle all sequences. Cons: unquote is not inverse of quote; totally inconsistent. 4. quote accepts either bytes or str, unquote default to outputting bytes unless given an encoding argument. In favour: Bill Janssen Pros: Technically does what the spec says, which is treat it as an octet encoding. Cons: unquote will break most existing code; almost 100% of the time people will want it as a string. </impartiality> I'll just comment on #4 since I haven't already. Let's talk about quote and unquote separately. For quote, I'm all for letting it accept a bytes as well as a str. That doesn't break anything or surprise anyone. For unquote, I think it will break a lot and surprise everyone. I think that while this may be "purely" the best option, it's pretty silly. I reckon the vast majority of users will be surprised when they see it spitting out a bytes object, and all that most people will do is decode it as UTF-8. Besides, while you're reading the RFCs as "URLs specify a method for encoding octet sequences", I'm reading them as "URLs specify a method for encoding strings, and leave the character encoding unspecified." The second reading supports the idea that unquote outputs a str. I'm also recommending we add unquote_to_bytes to do what you suggest unquote should do. (So either way we'll get both versions of unquote; I'm just suggesting the one called "unquote" do the thing everybody expects). But that's less of a priority so I want to commit these urgent fixes first. I'm basically saying just two things: 1. The standards are undefined; 2. Therefore we should pick the most useful and/or intuitive default. IMHO choosing UTF-8 *is* the most useful AND intuitive, and will be more so in the future when more technologies are hard-coded as UTF-8 (which this RFC recommends they do in the future). I am also quite adamant that unquote be the inverse of quote. Are there any more opinions on this matter? It would be good to reach a consensus. If anyone seriously wants to push a different alternative to mine, please write a working implementation and attach it to issue 3300. On the technical side of things, does anybody have time to review my patch for this issue? http://bugs.python.org/issue3300 Patch 5. It's just a patch for unquote, quote, and small related functions, as well as numerous changes to test cases and documentation. Cheers Matt

Arg! Damnit, why do my replies get split off from the main thread? Sorry about any confusion this may be causing.

On Thu, Jul 31, 2008 at 12:11:40AM +1000, Matt Giuca wrote:
2. Default to UTF-8. In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven
Count me too: +1. Most sites I use theese days use UTF-8 for URL encoding. Examples: Wikipedia: http://ru.wikipedia.org/wiki/%D0%93%D0%B2%D0%B8%D0%B4%D0%BE_%D0%B2%D0%B0%D0%... LingVo (Russian-English dictionary): http://lingvo.yandex.ru/en?text=%D0%BF%D0%B8%D1%82%D0%BE%D0%BD
print urllib.quote(unicode('питон', 'koi8-r').encode('utf-8')) %D0%BF%D0%B8%D1%82%D0%BE%D0%BD
Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.

2008/7/30 Matt Giuca <matt.giuca@gmail.com>:
2. Default to UTF-8. In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven Pros: Fully working and tested solution is implemented; recommended by RFC 3986 for all future schemes; recommended by W3C for use with HTML; UTF-8 used by all major browsers; supports all characters; most existing code compatible by default; unquote is inverse of quote. Cons: By default, URIs may have invalid octet sequences (not possible to reverse).
+1, assuming that if you have a different encoding in the URI you can pass it as a parameter. Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/

Facundo Batista <facundobatista <at> gmail.com> writes:
2008/7/30 Matt Giuca <matt.giuca <at> gmail.com>:
2. Default to UTF-8. In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven Pros: Fully working and tested solution is implemented; recommended by RFC 3986 for all future schemes; recommended by W3C for use with HTML; UTF-8 used by all major browsers; supports all characters; most existing code compatible by default; unquote is inverse of quote. Cons: By default, URIs may have invalid octet sequences (not possible to reverse).
+1, assuming that if you have a different encoding in the URI you can pass it as a parameter.
+1 for me as well, with an optional encoding parameter to override the default. Also, your "con" is a "pro" to me, since it means errors are reported instead of silently producing garbage (as would be the case with latin1). Regards Antoine.

[I was pretty busy these days, so sorry for jumping in late again] * Matt Giuca wrote:
1. Leave it as it is. quote is Latin-1 if range(0,256), fallback to UTF-8. unquote is Latin-1. In favour: Anybody who doesn't reply to this thread Pros: Already implemented; some existing code depends upon ord values of string being the same as they were for byte strings; possible to hack around it. Cons: unquote is not inverse of quote; quote behaviour internally-inconsistent; garbage when unquoting UTF-8-encoded URIs.
2. Default to UTF-8. In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven Pros: Fully working and tested solution is implemented; recommended by RFC 3986 for all future schemes; recommended by W3C for use with HTML; UTF-8 used by all major browsers; supports all characters; most existing code compatible by default; unquote is inverse of quote. Cons: By default, URIs may have invalid octet sequences (not possible to reverse).
Con: URI encoding does not encode characters.
3. quote default to UTF-8, unquote default to Latin-1. In favour: André Malo Pros: quote able to handle all characters; unquote able to handle all sequences. Cons: unquote is not inverse of quote; totally inconsistent.
I'm not in favour of that. I merely answered a question there ;) I'm actually in favour of encoding bytes only back and forth. A useful extension would be *another* function which wraps quote/unquote and encodes and decodes characters.
4. quote accepts either bytes or str, unquote default to outputting bytes unless given an encoding argument. In favour: Bill Janssen Pros: Technically does what the spec says, which is treat it as an octet encoding. Cons: unquote will break most existing code; almost 100% of the time people will want it as a string.
</impartiality>
I'll just comment on #4 since I haven't already. Let's talk about quote and unquote separately. For quote, I'm all for letting it accept a bytes as well as a str. That doesn't break anything or surprise anyone.
For unquote, I think it will break a lot and surprise everyone. I think that while this may be "purely" the best option, it's pretty silly. I reckon the vast majority of users will be surprised when they see it spitting out a bytes object, and all that most people will do is decode it as UTF-8. Besides, while you're reading the RFCs as "URLs specify a method for encoding octet sequences", I'm reading them as "URLs specify a method for encoding strings, and leave the character encoding unspecified." The second reading supports the idea that unquote outputs a str.
I'm also recommending we add unquote_to_bytes to do what you suggest unquote should do. (So either way we'll get both versions of unquote; I'm just suggesting the one called "unquote" do the thing everybody expects). But that's less of a priority so I want to commit these urgent fixes first.
I'm basically saying just two things: 1. The standards are undefined;
That's still disputed...
2. Therefore we should pick the most useful and/or intuitive default. IMHO choosing UTF-8 *is* the most useful AND intuitive, and will be more so in the future when more technologies are hard-coded as UTF-8 (which this RFC recommends they do in the future).
See my suggestion above. nd

On Wed, Jul 30, 2008 at 8:09 AM, André Malo <nd@perlig.de> wrote:
I'm actually in favour of encoding bytes only back and forth. A useful extension would be *another* function which wraps quote/unquote and encodes and decodes characters.
I'd reverse this. By all means, add a new pair of functions that is bytes in / bytes out. But keep the existing functions purely string in / string out, hardcoded to UTF-8. People wanting another encoding can use the bytes functions and explicit encode / decode calls. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On Wed, Jul 30, 2008 at 8:09 AM, André Malo <nd@perlig.de> wrote:
I'm actually in favour of encoding bytes only back and forth. A useful extension would be *another* function which wraps quote/unquote and encod= es and decodes characters.
I'd reverse this. By all means, add a new pair of functions that is bytes in / bytes out. But keep the existing functions purely string in / string out, hardcoded to UTF-8. People wanting another encoding can use the bytes functions and explicit encode / decode calls.
Actually (as I pointed out before) the existing functions are not string-in/string-out. They are something-in and bytes-out. just look like string-in/string-out because of the confusion between byte strings and Unicode strings in Python 1 and 2. Look, Matt's suggestion is a degradation of the integrity of the stdlib, because it enthrones a broken understanding, a misreading of the RFC, in a very prominent place. I'd prefer not to have Python contribute to that breakage. Keep the functions the way they are now: bytes-in and bytes-out. Bill

Actually (as I pointed out before) the existing functions are not string-in/string-out. They are something-in and bytes-out.
Sorry, this is wrong. "quote" is clearly bytes-in and string-out. "unquote" is clearly string-in and bytes-out. The whole point of "quote" is to take an arbitrary sequence of bytes and represent them as an ASCII string, while unquote reverses this process. Again, I urge everyone participating in this discussion to read RFC 3986. We're not creating in a vacuum here; we're talking about implementation of a standard. Bill
participants (7)
-
André Malo
-
Antoine Pitrou
-
Bill Janssen
-
Facundo Batista
-
Guido van Rossum
-
Matt Giuca
-
Oleg Broytmann