[Python-Dev] urllib.quote and unquote - Unicode issues
Guido van Rossum
guido at python.org
Wed Jul 30 20:00:09 CEST 2008
On Wed, Jul 30, 2008 at 10:33 AM, Bill Janssen <janssen at parc.com> wrote:
>> It looks like all other APIs in the Py3k version of
>> urllib treat URLs as text.
>
> The URL is text, a string of ASCII characters. We're just talking
> about urllib.quote() and urllib.unquote(), which are there to support
> the text-ization of binary values, and the de-text-ization.
>
>> I think that would break too much code, without a good way to
>> automatically fix it.
>
> You'd rather break Python? Somehow I don't think so.
Let's stop the rhetoric, or I'll have to beat you over the head with
the Zen of Python. :-)
urllib is not meant as a reference implementation of any RFC; it is
meant as a practical tool for Python users writing web apps (servers
and clients).
> Here's the signature I'm proposing:
>
> quote() -- takes string or bytes, and produces string.
>
> If input is a string, looks to optional "encoding" parameter to
> determine character set encoding to use to transform it to byte before
> quoting it. If "encoding" is not specified, defaults to UTF-8.
No contest here, since it supports the common string->string use case.
E.g. quote('a%b') returns 'a%25b'.
> unquote() -- takes string, produces bytes or string
>
> If optional "encoding" parameter is specified, decodes bytes with
> that encoding and returns string. Otherwise, returns bytes.
The default of returning bytes will break almost all uses. Most code
will uses the unquoted result as a text string, not as bytes -- e.g. a
server has to unquote the values it receives from a form (whether POST
or GET), but almost always the unquoted values are text, e.g.
someone's name or address, or a draft email message.
(Aside: I dislike functions that have a different return type based on
the value of a parameter.)
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-Dev
mailing list