[Python-Dev] urllib.quote and unquote - Unicode issues
janssen at parc.com
Sat Jul 12 23:07:09 CEST 2008
> Basically, urllib.quote and unquote seem not to have been updated since
> Python 2.5, and because of this they implicitly perform Latin-1 encoding and
> decoding (with respect to percent-encoded characters). I think they should
> default to UTF-8 for a number of reasons, including that's what other
> software such as web browsers use.
The standard here is RFC 3986, from Jan 2005, which says,
``When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be
The "unreserved set" consists of the following ASCII characters:
``Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
There are a few other wrinkles; it's worth reading section 2.5
I'd say, treat the incoming data as either Unicode (if it's a Unicode
string), or some unknown superset of ASCII (which includes both
Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown
encoding), and apply the appropriate transformation.
More information about the Python-Dev