[Python-Dev] urllib.quote and unquote - Unicode issues

Mon Jul 14 05:22:23 CEST 2008

On Mon, Jul 14, 2008 at 4:54 AM, André Malo <nd at perlig.de> wrote:

>
> Ahem. The HTTP standard does ;-)
>

Really? Can you include a quotation please? The HTTP standard talks a lot
about ISO-8859-1 (Latin-1) in terms of actually raw encoded bytes, but not
in terms of URI percent-encoding (a different issue) as far as I can tell.

>
> > Where web forms are concerned, basically setting the form accept-charset
> > or the page charset is the *maximum amount* of control you have over the
> > encoding. As you say, it can be encoded by another page or the user can
> > override their settings. Then what can you do as the server? Nothing ...
>
> Guessing works pretty well in most of the cases.
>

Are you suggesting that urllib.unquote guess the encoding? It could do that
but it would make things rather unpredictable. I think if this was an
application (such as a web browser), then guessing is OK. But this is a
library function. Library functions should not make arbitrary decisions;
they should be well-specified.

Latin-1 is not exactly arbitray. Besides being a charset - it maps
> one-to-one to octet values, hence it's commonly used to encode octets and
> is therefore a better fallback than every other encoding.
>

True. So the only advantage I see to the current implementation is that if
you really want to, you can take the Latin-1-decoded URI (from unquote) and
explicitly encode it as Latin-1 and then decode it again as whatever
encoding you want. But that would be a hack, would it not? I'd prefer if the
library didn't require a hack just to get the extremely common use case
(UTF-8).

>
> > I agree. However if there *was* a proper standard we wouldn't have to
> > argue! "Most proper" and "should do" is the most confident we can be when
> > dealing with this standard, as there is no correct encoding.
>
> Well, the standard says, there are octets to be encoded. I find that proper
> enough.

Yes but unfortunately we aren't talking about octets any more in Python 3,
but characters. If we're going to follow the standard and encode octets,
then we should be accepting (for quote) and returning (for unquote) bytes
objects, not strings. But as that's going to break most existing code and be
extremely confusing, I think it's best we try and solve this problem for
Unicode strings.

> > Does anyone have a suggestion which will be more compatible with the rest
> > of the world than allowing the user to select an encoding, and defaulting
> > to "utf-8"?
>
> Default to latin-1 for decoding and utf-8 for encoding. This might be
> confusing though, so maybe you've asked the wrong question ;)
>

:o that would break so so much existing code, not to mention being horribly
inconsistent and confusing. Having said that, that's almost what the current
behaviour is (quote uses Latin-1 for characters < 256, and UTF-8 for
characters above; unquote uses Latin-1).

Again I bring up the http server example. If you go to a directory, create a
file with a name such as '漢字', and then run this code in Python 3.0 from
that directory:

import http.server
s = http.server.HTTPServer(('',8000),
        http.server.SimpleHTTPRequestHandler)
s.serve_forever()

You'll see the file in the directory listing - its HTML will be <a
href="%E6%BC%A2%E5%AD%97">漢字</a>. But if you click it, you get a 404 because
the server will look for the file named unquote("%E6%BC%A2%E5%AD%97") =
'æ¼¢å\xad\x97'.

If you apply my patch (patch5) *everything* *just* *works*.

On Mon, Jul 14, 2008 at 6:36 AM, Bill Janssen <janssen at parc.com> wrote:

> > Ah there may be some confusion here. We're only dealing with str->str
> > transformations (which in Python 3 means Unicode strings). You can't put
> a
> > bytes in or get a bytes out of either of these functions. I suggested a
> > "quote_raw" and "unquote_raw" function which would let you do this.
>
> Ah, well, that's a problem.  Clearly the unquote is str->bytes, while
> the quote is (bytes OR str)->str.
>

OK so for quote, you're suggesting that we accept either a bytes or a str
object. That sounds quite reasonable (though neither the unpatched or
patched versions accept a bytes at the moment). I'd simply change the code
in quote (from patch5) to do this:

if isinstance(s, str):
    s = s.encode(encoding, errors)
....
res = map(quoter, s)

Now you get this behaviour by default (which may appear confusing but I'd
argue correct given the different semantics of 'h\xfcllo' and b'h\xfcllo'):

>>> urllib.parse.quote(b'h\xfcllo')
'h%FCllo'           # Directly-encoded octets
>>> urllib.parse.quote('h\xfcllo')
'h%C3%BCllo'     # UTF-8 encoded string, then encoded octets

Clearly the unquote is str->bytes, <snip> You can't pass a Unicode string
> back
> as the result of unquote *without* passing in an encoding specifier,
> because the character set is application-specific.
>

So for unquote you're suggesting that it always return a bytes object UNLESS
an encoding is specified? As in:

>>> urllib.parse.unquote('h%C3%BCllo')
b'h\xc3\xbcllo'

I would object to that on two grounds. Firstly, I wouldn't expect or desire
a bytes object. The vast majority of uses for unquote will be to get a
character string out, not bytes. Secondly, there is a mountain of code
(including about 12 modules in the standard library) which call unquote and
don't give the user the encoding option, so it's best if we pick a default
that is what the majority of users will expect. I argue that that's UTF-8.

I'd prefer having a separate unquote_raw function which is str->bytes, and
the unquote function performs the same role as it always have, which is
str->str. But I agree on quote, I think it can be (bytes OR str)->str.

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20080714/e2c64a4a/attachment-0001.htm>