[issue5468] urlencode does not handle "bytes", and could easily handle alternate encodings

Thu Mar 26 23:27:23 CET 2009

Dan Mahn <dan.mahn at digidescorp.com> added the comment:

Hello.  Thanks for the feedback.

With regards to RFC 2396, I see this:

http://www.ietf.org/rfc/rfc2396.txt

====
There is a second translation for some resources: the sequence of
    octets defined by a component of the URI is subsequently used to
    represent a sequence of characters. A 'charset' defines this mapping.
    There are many charsets in use in Internet protocols. For example,
    UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
    of characters in the repertoire of ISO 10646.
====

To me, that text does not indicate that URLs are always encoded in 
UTF-8.  It indicates that URL information may be encoded in character 
sets ('charset') other than ASCII, and when it is, the values must be 
sent as escaped values.  Here, I note the specific words "many charsets 
in use" and "For example", before the reference to UTF-8.

I have also done a few tests, and have found that in practice, browsers 
do not always encode URLs as UTF-8.  This actually seems to differ as to 
what part of the URL is being encoded.  For instance, my Firefox will 
encode the path portion of a URL as UTF-8, but encode the query string 
as Latin-1.

I think that the general idea is ... URL data must be encoded into 
ASCII, but as to what the data is that is being encoded ... That may be 
of some "charset" which may be application-defined.  And in the most 
general sense, I would argue that the data could simply be binary data. 
  (Actually, Latin-1 pretty much uses all the codes from 0 to 255, so 
it's very much like plain binary data anyway.)

I hope that clarifies what I am reading in RFC 2396.

In addition, quote_plus() already handles all the cases I placed into 
urlencode().  I suppose the actual test cases may be debatable, but I 
did specifically choose tests with data which would be recognized as 
something other then UTF-8.

Jeremy Hylton wrote:
> Jeremy Hylton <jeremy at alum.mit.edu> added the comment:
> 
> I'm not sure I understand the part of the code that deals with binary
> strings.  I agree the current behavior is odd.  RFC 2396 says that
> non-ascii characters must be encoded as utf-8 and then percent escaped.
>  In the test case you started with, you encoded b'\xa0\x24'.  It doesn't
> seem like this should be allowed, since it is not valid utf-8.
> 
> ----------
> nosy: +jhylton
> 
> _______________________________________
> Python tracker <report at bugs.python.org>
> <http://bugs.python.org/issue5468>
> _______________________________________

----------
title: urlencode does not handle "bytes", and could easily handle alternate encodings -> urlencode does not handle "bytes",	and could easily handle alternate encodings

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue5468>
_______________________________________