Re: [Python-Dev] urllib.quote and unquote - Unicode issues

Bill wrote: I'm not sure that's sufficient review, though I agree it's necessary.
The major consumers of quote/unquote are not in the Python standard
library. I figured that Python 3.0 is designed to fix things, with the breaking third-party code being an acceptable side-effect of that. So the most important thing when 3.0 is released is that the stdlib is internally consistent. All other code is "allowed" to be broken. So I've investigated all the code necessary. Having said this, my patch breaks almost no code. Your suggestion breaks a hell of a lot. Sure. All I was asking was that we not break the existing usage of
the standard library "unquote" by producing a string by *assuming* a
UTF-8 encoded string is what's in those percent-encoded bytes (instead
of, say, ISO 2022-JP). Let the "new" function produce a string:
"unquote_as_string". You're assuming that a Python 2.x "str" is the same thing as a Python 3.0 "bytes". It isn't. (If it was, this transition would be trivial). A Python 2 "str" is a non-Unicode string. It can be printed, concatenated with Unicode strings, etc etc. It has the semantics of a string. The Python 3.0 "bytes" is not a string at all. What you're saying is "the old behaviour was to output a bytes, so the new behaviour should be consistent". But that isn't true - the old behaviour was to output a string (a non-Unicode one). People, and code, expect it to output something with string semantics. So making unquote output a bytes is just as big a change as making it output a (unicode) str. Python 3.0 doesn't have a type which is like Python 2's "str" type (which is good - that type was very messy). So the argument that "Python 2 unquote outputs a bytes, so we should too" is not legitimate. If you want to keep pushing this, please install my new patch (patch 6). Then rename "unquote" to "unquote_to_string" and rename "unquote_to_bytes" to "unquote", and witness the havoc that ensues. Firstly, you break most Internet-related modules in the standard library. 10 tests failed:
test_SimpleHTTPServer test_cgi test_email test_http_cookiejar
test_httpservers test_robotparser test_urllib test_urllib2
test_urllib2_localnet test_wsgiref
Fixing these isn't a matter of changing test cases (which all but one of my fixes were). It would require changes to all the modules, to get them to deal with bytes instead of strings (which would generally mean spraying .decode("utf-8") all over the place). My code, on the other hand, "tends to be" compatible with 2.x code. Here I'm seeing: BytesWarning: Comparison between bytes and string. TypeError: expected an object with the buffer interface http.client.BadStatusLine For another example, try this:
The current (unpatched) build works, but links to files with non-ASCII filenames (eg. '漢字') break, because of the URL. This is one example of my patch directly fixing a bug in real code. With my patch applied, the links work fine *because URL quoting and unquoting are consistent, and work on all Unicode characters*. If you change unquote to output a bytes, it breaks completely. You get a "TypeError: expected an object with the buffer interface" as soon as the user visits the page. Matt
participants (1)
-
Matt Giuca