[Python-Dev] urllib.quote and unquote - Unicode issues

Wed Jul 30 19:12:46 CEST 2008

On Wed, Jul 30, 2008 at 9:52 AM, Bill Janssen <janssen at parc.com> wrote:
>> On Wed, Jul 30, 2008 at 8:09 AM, André Malo <nd at perlig.de> wrote:
>> > I'm actually in favour of encoding bytes only back and forth. A useful
>> > extension would be *another* function which wraps quote/unquote and encod=
>> es
>> > and decodes characters.
>>
>> I'd reverse this. By all means, add a new pair of functions that is
>> bytes in / bytes out. But keep the existing functions purely string in
>> / string out, hardcoded to UTF-8. People wanting another encoding can
>> use the bytes functions and explicit encode / decode calls.
>
> Actually (as I pointed out before) the existing functions are not
> string-in/string-out.  They are something-in and bytes-out.  just look
> like string-in/string-out because of the confusion between byte
> strings and Unicode strings in Python 1 and 2.

Actually, we'd need to look at the various other APIs in Py3k before
we can decide whether these should be considered taking or returning
bytes or text. It looks like all other APIs in the Py3k version of
urllib treat URLs as text. I don't think switching these to bytes
would be a good idea; you might as well claim that filenames should be
bytes because that's how the filesystem stores them.

> Look, Matt's suggestion is a degradation of the integrity of the
> stdlib, because it enthrones a broken understanding, a misreading of
> the RFC, in a very prominent place.  I'd prefer not to have Python
> contribute to that breakage.  Keep the functions the way they are now:
> bytes-in and bytes-out.

I think that would break too much code, without a good way to
automatically fix it.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)