[Python-Dev] urllib.quote and unquote - Unicode issues

Wed Jul 30 20:00:09 CEST 2008

On Wed, Jul 30, 2008 at 10:33 AM, Bill Janssen <janssen at parc.com> wrote:
>> It looks like all other APIs in the Py3k version of
>> urllib treat URLs as text.
>
> The URL is text, a string of ASCII characters.  We're just talking
> about urllib.quote() and urllib.unquote(), which are there to support
> the text-ization of binary values, and the de-text-ization.
>
>> I think that would break too much code, without a good way to
>> automatically fix it.
>
> You'd rather break Python?  Somehow I don't think so.

Let's stop the rhetoric, or I'll have to beat you over the head with
the Zen of Python. :-)

urllib is not meant as a reference implementation of any RFC; it is
meant as a practical tool for Python users writing web apps (servers
and clients).

> Here's the signature I'm proposing:
>
>  quote() -- takes string or bytes, and produces string.
>
>     If input is a string, looks to optional "encoding" parameter to
>     determine character set encoding to use to transform it to byte before
>     quoting it.  If "encoding" is not specified, defaults to UTF-8.

No contest here, since it supports the common string->string use case.
E.g. quote('a%b') returns 'a%25b'.

>  unquote() -- takes string, produces bytes or string
>
>     If optional "encoding" parameter is specified, decodes bytes with
>     that encoding and returns string.  Otherwise, returns bytes.

The default of returning bytes will break almost all uses. Most code
will uses the unquoted result as a text string, not as bytes -- e.g. a
server has to unquote the values it receives from a form (whether POST
or GET), but almost always the unquoted values are text, e.g.
someone's name or address, or a draft email message.

(Aside: I dislike functions that have a different return type based on
the value of a parameter.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)