<div dir="ltr">Bill wrote:<br><br><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">
I'm not sure that's sufficient review, though I agree it's necessary.<br></blockquote><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">
The major consumers of quote/unquote are not in the Python standard<br></blockquote><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">
library.</blockquote><br>I
figured that Python 3.0 is designed to fix things, with the breaking
third-party code being an acceptable side-effect of that. So the most
important thing when 3.0 is released is that the stdlib is internally
consistent. All other code is "allowed" to be broken. So I've
investigated all the code necessary.<br><br>Having said this, my patch breaks almost no code. Your suggestion breaks a hell of a lot.<br>
<br><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">Sure. All I was asking was that we not break the existing usage of<br></blockquote><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">
the standard library "unquote" by producing a string by *assuming* a<br></blockquote><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">
UTF-8 encoded string is what's in those percent-encoded bytes (instead<br></blockquote><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">
of, say, ISO 2022-JP). Let the "new" function produce a string:<br></blockquote><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">
"unquote_as_string".</blockquote><br>You're
assuming that a Python 2.x "str" is the same thing as a Python 3.0
"bytes". It isn't. (If it was, this transition would be trivial). A
Python 2 "str" is a non-Unicode string. It can be printed, concatenated
with Unicode strings, etc etc. It has the semantics of a string. The
Python 3.0 "bytes" is not a string at all.<br>
<br>What you're saying is "the old behaviour was to output a bytes, so
the new behaviour should be consistent". But that isn't true - the old
behaviour was to output a string (a non-Unicode one). People, and code,
expect it to output something with string semantics. So making unquote
output a bytes is just as big a change as making it output a (unicode)
str. Python 3.0 doesn't have a type which is like Python 2's "str" type
(which is good - that type was very messy). So the argument that
"Python 2 unquote outputs a bytes, so we should too" is not legitimate.<br><br><br><br>If
you want to keep pushing this, please install my new patch (patch 6).
Then rename "unquote" to "unquote_to_string" and rename
"unquote_to_bytes" to "unquote", and witness the havoc that ensues.
Firstly, you break most Internet-related modules in the standard
library.<br><br><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">10 tests failed:<br></blockquote><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">
test_SimpleHTTPServer test_cgi test_email test_http_cookiejar<br></blockquote><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote"> test_httpservers test_robotparser test_urllib test_urllib2<br>
</blockquote><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote"> test_urllib2_localnet test_wsgiref<br></blockquote><br>Fixing
these isn't a matter of changing test cases (which all but one of my
fixes were). It would require changes to all the modules, to get them
to deal with bytes instead of strings (which would generally mean spraying .decode("utf-8") all over the place). My code, on the other hand,
"tends to be" compatible with 2.x code.<br><br>Here I'm seeing:<br>BytesWarning: Comparison between bytes and string.<br>TypeError: expected an object with the buffer interface<br>http.client.BadStatusLine<br>
<br>For another example, try this:<br>
<br>>>> import http.server<br>
>>> s = http.server.HTTPServer(('',8000), http.server.SimpleHTTPRequestHandler)<br>
>>> s.serve_forever()<br>
<br>
The current (unpatched) build works, but links to files with non-ASCII filenames (eg. '<span style="font-weight: normal;"><span class="t_nihongo_kanji" lang="ja">漢字') </span></span>break, because of the URL. This is one example of my patch directly fixing a bug in real code. With my patch applied, the links work fine *because URL quoting and unquoting are consistent, and work on all Unicode characters*.<br>
<br>If you change unquote to output a bytes, it breaks completely. You get a "TypeError: expected an object with the buffer interface" as soon as the user visits the page.<br><br>Matt<br></div>