[Baypiggies] urllib.urlencode and encoding

Keith Dart ♂ keith at dartworks.biz
Thu Apr 19 02:08:13 CEST 2007

Tung Wai Yip wrote the following on 2007-04-18 at 16:51 PDT:
> urllib.urlencode() cannot encode unicode string itself. RFC 2396 has not  
> taken unicode into consideration. So there is no rule on what to do with  
> unicode in an URI. It is up to the application to decide on the encoding,  
> e.g. UTF-8 first, url encoding next. Others might very well choose to use  
> UTF-16 instead.


Nope, see RFC 3986:

Network Working Group                                     T. Berners-Lee
Request for Comments: 3986                                       W3C/MIT
STD: 66                                                      R. Fielding
Updates: 1738                                               Day Software
Obsoletes: 2732, *2396*, 1808                                

Section 2.5:

   When a new URI scheme defines a component that represents textual
   data consisting of characters from the Universal Character Set [UCS],
   the data should first be encoded as octets according to the UTF-8
   character encoding [STD63]; then only those octets that do not
   correspond to characters in the unreserved set should be percent-
   encoded.  For example, the character A would be represented as "A",
   the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
   as "%C3%80", and the character KATAKANA LETTER A would be represented
   as "%E3%82%A2".

-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Keith Dart <keith at dartworks.biz>
   public key: ID: 19017044

More information about the Baypiggies mailing list