urllib.unquote and unicode

Wed Dec 20 04:12:17 EST 2006

"Martin v. Löwis" <martin at v.loewis.de> wrote:

> Duncan Booth schrieb:
>> The way that uri encoding is supposed to work is that first the input
>> string in unicode is encoded to UTF-8 and then each byte which is not
>> in the permitted range for characters is encoded as % followed by two
>> hex characters. 
> 
> Can you back up this claim ("is supposed to work") by reference to
> a specification (ideally, chapter and verse)?

I'm not sure I have time to read the various RFC's in depth right now,
so I may have to come back on this thread later. The one thing I'm
convinced of is that the current implementations of urllib.quote and
urllib.unquote are broken in respect to their handling of unicode. In 
particular % encoding is defined in terms of octets, so when given a 
unicode string urllib.quote should either encoded it, or throw a suitable 
exception (not KeyError which is what it seems to throw now).

My objection to urllib.unquote is that urllib.unquote(u'%a3') returns 
u'\xa3' which is a character not an octet. I think it should always return 
a byte string, or it should calculate a byte string and then decode it 
according to some suitable encoding, or it should throw an exception 
[choose any of the above].

Adding an optional encoding parameter to quote/unquote be one option, 
although since you can encode/decode the parameter it doesn't add much.

> No, the http scheme is defined by RFC 2616 instead. It doesn't really
> talk about encodings, but hints an interpretation in 3.2.3:

The applicable RFC is 3986. See RFC2616 section 3.2.1:
> For definitive information on URL syntax and semantics, see "Uniform 
> Resource Identifiers (URI):
> Generic Syntax and Semantics," RFC 2396 [42] (which replaces RFCs
> 1738 [4] and RFC 1808 [11]).

and RFC 2396:
> Obsoleted by: 3986

> Now, RFC 2396 already says that URIs are sequences of characters,
> not sequences of octets, yet RFC 2616 fails to recognize that issue
> and refuses to specify a character set for its scheme (which
> RFC 2396 says that it could).

and RFC2277, 3.1 says that it MUST identify which charset is used (although 
that's just a best practice document not a standard). (The block capitals 
are the RFC's not mine.)

> The conventional wisdom is that the choice of URI encoding for HTTP
> is a server-side decision; for that reason, IRIs were introduced.

Yes, I know that in practice some systems use other character sets.