urllib.unquote and unicode

Tue Dec 19 15:50:06 EST 2006

Duncan Booth schrieb:
> The way that uri encoding is supposed to work is that first the input
> string in unicode is encoded to UTF-8 and then each byte which is not in
> the permitted range for characters is encoded as % followed by two hex
> characters. 

Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?

In URIs, it is entirely unspecified what the encoding is of non-ASCII
characters, and whether % escapes denote characters in the first place.

> Unfortunately RFC3986 isn't entirely clear-cut on this issue:
> 
>>    When a new URI scheme defines a component that represents textual
>>    data consisting of characters from the Universal Character Set [UCS],
>>    the data should first be encoded as octets according to the UTF-8
>>    character encoding [STD63]; then only those octets that do not
>>    correspond to characters in the unreserved set should be percent-
>>    encoded.  For example, the character A would be represented as "A",
>>    the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
>>    as "%C3%80", and the character KATAKANA LETTER A would be represented
>>    as "%E3%82%A2".

This is irrelevant, it talks about new URI schemes only.

> I think it leaves open the possibility that existing URI schemes which do 
> not support unicode characters can use other encodings, but given that the 
> original posting started by decoding a unicode string I think that utf-8 
> should definitely be assumed in this case.

No, the http scheme is defined by RFC 2616 instead. It doesn't really
talk about encodings, but hints an interpretation in 3.2.3:

# When comparing two URIs to decide if they match or not, a client
# SHOULD use a case-sensitive octet-by-octet comparison of the entire
# URIs, [...]
# Characters other than those in the "reserved" and "unsafe" sets (see
# RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

Now, RFC 2396 already says that URIs are sequences of characters,
not sequences of octets, yet RFC 2616 fails to recognize that issue
and refuses to specify a character set for its scheme (which
RFC 2396 says that it could).

The conventional wisdom is that the choice of URI encoding for HTTP
is a server-side decision; for that reason, IRIs were introduced.

Regards,
Martin