[Python-Dev] urllib.quote and unquote - Unicode issues

Thu Jul 31 14:25:09 CEST 2008

Bill wrote:

I'm not sure that's sufficient review, though I agree it's necessary.
>
The major consumers of quote/unquote are not in the Python standard
>
library.

I figured that Python 3.0 is designed to fix things, with the breaking
third-party code being an acceptable side-effect of that. So the most
important thing when 3.0 is released is that the stdlib is internally
consistent. All other code is "allowed" to be broken. So I've investigated
all the code necessary.

Having said this, my patch breaks almost no code. Your suggestion breaks a
hell of a lot.

Sure.  All I was asking was that we not break the existing usage of
>
the standard library "unquote" by producing a string by *assuming* a
>
UTF-8 encoded string is what's in those percent-encoded bytes (instead
>
of, say, ISO 2022-JP).  Let the "new" function produce a string:
>
"unquote_as_string".

You're assuming that a Python 2.x "str" is the same thing as a Python 3.0
"bytes". It isn't. (If it was, this transition would be trivial). A Python 2
"str" is a non-Unicode string. It can be printed, concatenated with Unicode
strings, etc etc. It has the semantics of a string. The Python 3.0 "bytes"
is not a string at all.

What you're saying is "the old behaviour was to output a bytes, so the new
behaviour should be consistent". But that isn't true - the old behaviour was
to output a string (a non-Unicode one). People, and code, expect it to
output something with string semantics. So making unquote output a bytes is
just as big a change as making it output a (unicode) str. Python 3.0 doesn't
have a type which is like Python 2's "str" type (which is good - that type
was very messy). So the argument that "Python 2 unquote outputs a bytes, so
we should too" is not legitimate.

If you want to keep pushing this, please install my new patch (patch 6).
Then rename "unquote" to "unquote_to_string" and rename "unquote_to_bytes"
to "unquote", and witness the havoc that ensues. Firstly, you break most
Internet-related modules in the standard library.

10 tests failed:
>
    test_SimpleHTTPServer test_cgi test_email test_http_cookiejar
>
    test_httpservers test_robotparser test_urllib test_urllib2
>
    test_urllib2_localnet test_wsgiref
>

Fixing these isn't a matter of changing test cases (which all but one of my
fixes were). It would require changes to all the modules, to get them to
deal with bytes instead of strings (which would generally mean spraying
.decode("utf-8") all over the place). My code, on the other hand, "tends to
be" compatible with 2.x code.

Here I'm seeing:
BytesWarning: Comparison between bytes and string.
TypeError: expected an object with the buffer interface
http.client.BadStatusLine

For another example, try this:

>>> import http.server
>>> s = http.server.HTTPServer(('',8000),
http.server.SimpleHTTPRequestHandler)
>>> s.serve_forever()

The current (unpatched) build works, but links to files with non-ASCII
filenames (eg. '漢字') break, because of the URL. This is one example of my
patch directly fixing a bug in real code. With my patch applied, the links
work fine *because URL quoting and unquoting are consistent, and work on all
Unicode characters*.

If you change unquote to output a bytes, it breaks completely. You get a
"TypeError: expected an object with the buffer interface" as soon as the
user visits the page.

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20080731/7bf35eb4/attachment.htm>