Re: [Python-Dev] urllib.quote and unquote - Unicode issues

On Wed, Jul 30, 2008 at 10:33 AM, Bill Janssen <janssen@parc.com> wrote:
It looks like all other APIs in the Py3k version of urllib treat URLs as text.
The URL is text, a string of ASCII characters. We're just talking about urllib.quote() and urllib.unquote(), which are there to support the text-ization of binary values, and the de-text-ization.
I think that would break too much code, without a good way to automatically fix it.
You'd rather break Python? Somehow I don't think so.
Let's stop the rhetoric, or I'll have to beat you over the head with the Zen of Python. :-) urllib is not meant as a reference implementation of any RFC; it is meant as a practical tool for Python users writing web apps (servers and clients).
Here's the signature I'm proposing:
quote() -- takes string or bytes, and produces string.
If input is a string, looks to optional "encoding" parameter to determine character set encoding to use to transform it to byte before quoting it. If "encoding" is not specified, defaults to UTF-8.
No contest here, since it supports the common string->string use case. E.g. quote('a%b') returns 'a%25b'.
unquote() -- takes string, produces bytes or string
If optional "encoding" parameter is specified, decodes bytes with that encoding and returns string. Otherwise, returns bytes.
The default of returning bytes will break almost all uses. Most code will uses the unquoted result as a text string, not as bytes -- e.g. a server has to unquote the values it receives from a form (whether POST or GET), but almost always the unquoted values are text, e.g. someone's name or address, or a draft email message. (Aside: I dislike functions that have a different return type based on the value of a parameter.) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

(Aside: I dislike functions that have a different return type based on the value of a parameter.)
I wanted to stay out of the whole discussion as it's largely over my head... But I did want to express support for this idea which I think almost rises to the level of a standard... I see more bugs created in our software because of the above issues then anything else... I have no problem with functions that accept various input but producing various outputs just seems to wreak havoc...

unquote() -- takes string, produces bytes or string
If optional "encoding" parameter is specified, decodes bytes with that encoding and returns string. Otherwise, returns bytes.
The default of returning bytes will break almost all uses. Most code will uses the unquoted result as a text string, not as bytes -- e.g. a server has to unquote the values it receives from a form (whether POST or GET), but almost always the unquoted values are text, e.g. someone's name or address, or a draft email message.
I actually do know a lot about the uses of this function... But: OK, OK, I yield. Though I still think this is a bad idea, I'll shut up if we can also add "unquote_as_bytes" which returns a byte sequence instead of a string. I'll just change my code to use that.
(Aside: I dislike functions that have a different return type based on the value of a parameter.)
Fair enough. Bill
participants (3)
-
Bill Janssen
-
Guido van Rossum
-
Jeff Hall