[Python-Dev] urllib.quote and unquote - Unicode issues

Bill Janssen janssen at parc.com
Mon Jul 14 19:39:42 CEST 2008


>> Clearly the unquote is str->bytes, <snip> You can't pass a Unicode string back
>> as the result of unquote *without* passing in an encoding specifier,
>> because the character set is application-specific.
> So for unquote you're suggesting that it always return a bytes object
> UNLESS an encoding is specified? As in:
> >> urllib.parse.unquote('h%C3%BCllo')
> b'h\xc3\xbcllo'

Yes, that's correct.  That's what the RFC says we have to do.

> I would object to that on two grounds. Firstly, I wouldn't expect or
> desire a bytes object. The vast majority of uses for unquote will be
> to get a character string out, not bytes. Secondly, there is a
> mountain of code (including about 12 modules in the standard library)
> which call unquote and don't give the user the encoding option, so
> it's best if we pick a default that is what the majority of users will
> expect. I argue that that's UTF-8.

Unfortunately, despite your expectations or desires, the spec doesn't
allow us that luxury.  It's bytes out, and they may even be in a
non-standard (not registered with IANA) encoding.  There's no way to
safely and correctly turn that sequence of bytes into a string.  If
other modules have been mis-using the interface, they are buggy and
should be fixed.  There's a lot of buggy stdlib code in Python around
the older Web standards.

I think it would be great to have another function, unquote_to_string,
which took an extra "encoding" parameter, and returned a string.  It
would also be OK to add a keyword parameter to "unquote", I think,
which provides an encoding, and causes unquote to return a string.
But the standard behavior has to be to return bytes.

> I'd prefer having a separate unquote_raw function which is
> str->bytes, and the unquote function performs the same role as it
> always have, which is str->str.

Actually, it was originally bytes->bytes, because there was no notion
of Unicode strings when it was added.  It perhaps got misunderstood
during the addition of Unicode support to Python; many people have had
trouble wrapping their heads around all this, myself included.

Bill


More information about the Python-Dev mailing list