[Python-Dev] urllib.quote and unquote - Unicode issues

Wed Jul 30 17:09:51 CEST 2008

[I was pretty busy these days, so sorry for jumping in late again]

* Matt Giuca wrote: 

> 1. Leave it as it is. quote is Latin-1 if range(0,256), fallback to
> UTF-8. unquote is Latin-1.
> In favour: Anybody who doesn't reply to this thread
> Pros: Already implemented; some existing code depends upon ord values
> of string being the same as they were for byte strings; possible to
> hack around it.
> Cons: unquote is not inverse of quote; quote behaviour
> internally-inconsistent; garbage when unquoting UTF-8-encoded URIs.

> 2. Default to UTF-8.
> In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven
> Pros: Fully working and tested solution is implemented; recommended by
> RFC 3986 for all future schemes; recommended by W3C for use with HTML;
> UTF-8 used by all major browsers; supports all characters; most
> existing code compatible by default; unquote is inverse of quote.
> Cons: By default, URIs may have invalid octet sequences (not possible
> to reverse).

Con: URI encoding does not encode characters.

>
> 3. quote default to UTF-8, unquote default to Latin-1.
> In favour: André Malo
> Pros: quote able to handle all characters; unquote able to handle all
> sequences. Cons: unquote is not inverse of quote; totally inconsistent.

I'm not in favour of that. I merely answered a question there ;)

I'm actually in favour of encoding bytes only back and forth. A useful 
extension would be *another* function which wraps quote/unquote and encodes 
and decodes characters.

> 4. quote accepts either bytes or str, unquote default to outputting
> bytes unless given an encoding argument.
> In favour: Bill Janssen
> Pros: Technically does what the spec says, which is treat it as an
> octet encoding.
> Cons: unquote will break most existing code; almost 100% of the time
> people will want it as a string.
>
> </impartiality>
>
> I'll just comment on #4 since I haven't already. Let's talk about
> quote and unquote separately. For quote, I'm all for letting it accept
> a bytes as well as a str. That doesn't break anything or surprise
> anyone.
>
> For unquote, I think it will break a lot and surprise everyone. I
> think that while this may be "purely" the best option, it's pretty
> silly. I reckon the vast majority of users will be surprised when they
> see it spitting out a bytes object, and all that most people will do
> is decode it as UTF-8. Besides, while you're reading the RFCs as "URLs
> specify a method for encoding octet sequences", I'm reading them as
> "URLs specify a method for encoding strings, and leave the character
> encoding unspecified." The second reading supports the idea that
> unquote outputs a str.
>
> I'm also recommending we add unquote_to_bytes to do what you suggest
> unquote should do. (So either way we'll get both versions of unquote;
> I'm just suggesting the one called "unquote" do the thing everybody
> expects). But that's less of a priority so I want to commit these
> urgent fixes first.
>
> I'm basically saying just two things: 1. The standards are undefined;

That's still disputed...

> 2. Therefore we should pick the most useful and/or intuitive default.
> IMHO choosing UTF-8 *is* the most useful AND intuitive, and will be
> more so in the future when more technologies are hard-coded as UTF-8
> (which this RFC recommends they do in the future).

See my suggestion above.

nd