[Python-Dev] urllib.quote and unquote - Unicode issues

Bill Janssen janssen at parc.com
Sat Jul 12 23:07:09 CEST 2008


> Basically, urllib.quote and unquote seem not to have been updated since
> Python 2.5, and because of this they implicitly perform Latin-1 encoding and
> decoding (with respect to percent-encoded characters). I think they should
> default to UTF-8 for a number of reasons, including that's what other
> software such as web browsers use.

The standard here is RFC 3986, from Jan 2005, which says,

  ``When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be
percent-encoded.''

The "unreserved set" consists of the following ASCII characters:

  ``Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved.  These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.

   unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
''

There are a few other wrinkles; it's worth reading section 2.5
carefully.

I'd say, treat the incoming data as either Unicode (if it's a Unicode
string), or some unknown superset of ASCII (which includes both
Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown
encoding), and apply the appropriate transformation.

Bill



More information about the Python-Dev mailing list