[Python-Dev] urllib.quote and unquote - Unicode issues

Thu Jul 31 05:49:17 CEST 2008

> Con: URI encoding does not encode characters.

OK, for all the people who say URI encoding does not encode characters: yes
it does. This is not an encoding for binary data, it's an encoding for
character data, but it's unspecified how the strings map to octets before
being percent-encoded. From RFC 3986, section
1.2.1<http://tools.ietf.org/html/rfc3986#section-1.2.1>
:

Percent-encoded octets (Section 2.1) may be used within a URI to represent
> characters outside the range of the US-ASCII coded character set if this
> representation is allowed by the scheme or by the protocol element in which
> the URI is referenced.  Such a definition should specify the character
> encoding used to map those characters to octets prior to being
> percent-encoded for the URI.

So the string->string proposal is actually correct behaviour. I'm all in
favour of a bytes->string version as well, just not with the names "quote"
and "unquote".

I'll prepare a new patch shortly which has bytes->string and string->bytes
versions of the functions as well. (quote will accept either type, while
unquote will output a str, there will be a new function unquote_to_bytes
which outputs a bytes - is everyone happy with that?)

Guido says:

> Actually, we'd need to look at the various other APIs in Py3k before we can
> decide whether these should be considered taking or returning bytes or text.
> It looks like all other APIs in the Py3k version of urllib treat URLs as
> text.

Yes, as I said in the bug tracker, I've groveled over the entire stdlib to
see how my patch affects the behaviour of dependent code. Aside from a few
minor bits which assumed octets (and did their own encoding/decoding) (which
I fixed), all the code assumes strings and is very happy to go on assuming
this, as long as the URIs are encoded with UTF-8, which they almost
certainly are.

Guido says:

> I think the only change is to remove the encoding arguments and ...

You really want me to remove the encoding= named argument? And hard-code
UTF-8 into these functions? It seems like we may as well have the optional
encoding argument, as it does no harm and could be of significant benefit.
I'll post a patch with the unquote_to_bytes function, but leave the encoding
arguments in until this point is clarified.

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20080731/209fd01b/attachment.htm>