[Baypiggies] urllib.urlencode and encoding

Tung Wai Yip tungwaiyip at yahoo.com
Thu Apr 19 19:38:16 CEST 2007


> On Apr 18, 2007, at 5:08 PM, Keith Dart ♂ wrote:
>> >    When a new URI scheme defines a component that represents textual
>> >    data consisting of characters from the Universal Character Set
>> > [UCS],
>> >    the data should first be encoded as octets according to the UTF-8
>> >    character encoding [STD63]; then only those octets that do not
>> >    correspond to characters in the unreserved set should be percent-
>> >    encoded.  For example, the character A would be represented as "A",
>> >    the character LATIN CAPITAL LETTER A WITH GRAVE would be
>> > represented
>> >    as "%C3%80", and the character KATAKANA LETTER A would be
>> > represented
>> >    as "%E3%82%A2".

Thanks Keith for the heads up. One issue I regularly have is to track down  
the lineage of RFCs. When I found RFC X, I am often not aware of a RFC Y  
that supersede it. It doesn't help that historically there are many  
documents pointing to RFC X. But from RFC X itself it has no link to RFC  
Y. Try to follow the link from the bottom of the urlparse module  
documentation. It does not lead to RFC 3986.

   http://docs.python.org/lib/module-urlparse.html


On Wed, 18 Apr 2007 21:15:34 -0700, David Reid <dreid at dreid.org> wrote:
> The key piece of information here is "When a new URI scheme" the RFC
> (AFAICT) makes no mention of what to do about old schemes, such as
> HTTP.  In fact the HTML4 spec makes it's own claims as to %-encoded
> as a result of form submission:
>
> http://www.w3.org/TR/html4/interact/forms.html
>
>      accept-charset = charset list [CI]
>          This attribute specifies the list of character encodings for
> input data that is accepted by the server processing this form. The
> value is a space- and/or comma-delimited list of charset values. The
> client must interpret this list as an exclusive-or list, i.e., the
> server is able to accept any single character encoding per entity
> received.
> The default value for this attribute is the reserved string
> "UNKNOWN". User agents may interpret this value as the character
> encoding that was used to transmit the document containing this FORM
> element.
>
> So I think it's still incorrect for urllib to make any such
> assumptions as to the data being UTF-8. (Though I hope it won't be in
> the future.)
>
> - -David
> http://dreid.org

I think RFC 3986 says a character should be encoded in UTF-8 only if it is  
 from the UCS. But it is also legitimate to use other character set, for  
example as in the HTML4 spec David has pointed out. Say you are writing a  
screen scrapper for a Japanese website you should use the character  
encoding the website expects, which is not necessary UTF-8.

Wai Yip


More information about the Baypiggies mailing list