[issue5468] urlencode does not handle "bytes", and could easily handle alternate encodings
Dan Mahn
report at bugs.python.org
Thu Mar 26 23:27:23 CET 2009
Dan Mahn <dan.mahn at digidescorp.com> added the comment:
Hello. Thanks for the feedback.
With regards to RFC 2396, I see this:
http://www.ietf.org/rfc/rfc2396.txt
====
There is a second translation for some resources: the sequence of
octets defined by a component of the URI is subsequently used to
represent a sequence of characters. A 'charset' defines this mapping.
There are many charsets in use in Internet protocols. For example,
UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
of characters in the repertoire of ISO 10646.
====
To me, that text does not indicate that URLs are always encoded in
UTF-8. It indicates that URL information may be encoded in character
sets ('charset') other than ASCII, and when it is, the values must be
sent as escaped values. Here, I note the specific words "many charsets
in use" and "For example", before the reference to UTF-8.
I have also done a few tests, and have found that in practice, browsers
do not always encode URLs as UTF-8. This actually seems to differ as to
what part of the URL is being encoded. For instance, my Firefox will
encode the path portion of a URL as UTF-8, but encode the query string
as Latin-1.
I think that the general idea is ... URL data must be encoded into
ASCII, but as to what the data is that is being encoded ... That may be
of some "charset" which may be application-defined. And in the most
general sense, I would argue that the data could simply be binary data.
(Actually, Latin-1 pretty much uses all the codes from 0 to 255, so
it's very much like plain binary data anyway.)
I hope that clarifies what I am reading in RFC 2396.
In addition, quote_plus() already handles all the cases I placed into
urlencode(). I suppose the actual test cases may be debatable, but I
did specifically choose tests with data which would be recognized as
something other then UTF-8.
Jeremy Hylton wrote:
> Jeremy Hylton <jeremy at alum.mit.edu> added the comment:
>
> I'm not sure I understand the part of the code that deals with binary
> strings. I agree the current behavior is odd. RFC 2396 says that
> non-ascii characters must be encoded as utf-8 and then percent escaped.
> In the test case you started with, you encoded b'\xa0\x24'. It doesn't
> seem like this should be allowed, since it is not valid utf-8.
>
> ----------
> nosy: +jhylton
>
> _______________________________________
> Python tracker <report at bugs.python.org>
> <http://bugs.python.org/issue5468>
> _______________________________________
----------
title: urlencode does not handle "bytes", and could easily handle alternate encodings -> urlencode does not handle "bytes", and could easily handle alternate encodings
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue5468>
_______________________________________
More information about the Python-bugs-list
mailing list