urllib.unquote and unicode
"Martin v. Löwis"
martin at v.loewis.de
Tue Dec 19 15:50:06 EST 2006
Duncan Booth schrieb:
> The way that uri encoding is supposed to work is that first the input
> string in unicode is encoded to UTF-8 and then each byte which is not in
> the permitted range for characters is encoded as % followed by two hex
> characters.
Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?
In URIs, it is entirely unspecified what the encoding is of non-ASCII
characters, and whether % escapes denote characters in the first place.
> Unfortunately RFC3986 isn't entirely clear-cut on this issue:
>
>> When a new URI scheme defines a component that represents textual
>> data consisting of characters from the Universal Character Set [UCS],
>> the data should first be encoded as octets according to the UTF-8
>> character encoding [STD63]; then only those octets that do not
>> correspond to characters in the unreserved set should be percent-
>> encoded. For example, the character A would be represented as "A",
>> the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
>> as "%C3%80", and the character KATAKANA LETTER A would be represented
>> as "%E3%82%A2".
This is irrelevant, it talks about new URI schemes only.
> I think it leaves open the possibility that existing URI schemes which do
> not support unicode characters can use other encodings, but given that the
> original posting started by decoding a unicode string I think that utf-8
> should definitely be assumed in this case.
No, the http scheme is defined by RFC 2616 instead. It doesn't really
talk about encodings, but hints an interpretation in 3.2.3:
# When comparing two URIs to decide if they match or not, a client
# SHOULD use a case-sensitive octet-by-octet comparison of the entire
# URIs, [...]
# Characters other than those in the "reserved" and "unsafe" sets (see
# RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.
Now, RFC 2396 already says that URIs are sequences of characters,
not sequences of octets, yet RFC 2616 fails to recognize that issue
and refuses to specify a character set for its scheme (which
RFC 2396 says that it could).
The conventional wisdom is that the choice of URI encoding for HTTP
is a server-side decision; for that reason, IRIs were introduced.
Regards,
Martin
More information about the Python-list
mailing list