<br><br><div class="gmail_quote">On Mon, Jul 14, 2008 at 4:54 AM, André Malo &lt;<a href="mailto:nd@perlig.de">nd@perlig.de</a>&gt; wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="Ih2E3d"><br>

</div>Ahem. The HTTP standard does ;-)<br>

</blockquote><div><br>Really? Can you include

a quotation please? The HTTP standard talks a lot about ISO-8859-1

(Latin-1) in terms of actually raw encoded bytes, but not in terms of

URI percent-encoding (a different issue) as far as I can tell.<br>&nbsp;</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">

<br>

&gt; Where web forms are concerned, basically setting the form accept-charset<br>

&gt; or the page charset is the *maximum amount* of control you have over the<br>

&gt; encoding. As you say, it can be encoded by another page or the user can<br>

&gt; override their settings. Then what can you do as the server? Nothing ...<br>

<br>

</div>Guessing works pretty well in most of the cases.<br>

</blockquote><div><br>Are you suggesting that

urllib.unquote guess the encoding? It could do that but it would make

things rather unpredictable. I think if this was an application (such as a web

browser), then guessing is OK. But this is a library function. Library

functions should not make arbitrary decisions; they should be

well-specified.<br><br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Latin-1 is not exactly arbitray. Besides being a charset - it maps<br>


one-to-one to octet values, hence it&#39;s commonly used to encode octets and<br>

is therefore a better fallback than every other encoding.<br>

</blockquote><div><br>True. So the only

advantage I see to the current implementation is that if you really

want to, you can take the Latin-1-decoded URI (from unquote) and

explicitly encode it as Latin-1 and then decode it again as whatever

encoding you want. But that would be a hack, would it not? I&#39;d prefer

if the library didn&#39;t require a hack just to get the extremely common

use case (UTF-8).<br>&nbsp;</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><br>

&gt; I agree. However if there *was* a proper standard we wouldn&#39;t have to<br>

&gt; argue! &quot;Most proper&quot; and &quot;should do&quot; is the most confident we can be when<br>

&gt; dealing with this standard, as there is no correct encoding.<br>

<br>

</div>Well, the standard says, there are octets to be encoded. I find that proper<br>

enough.</blockquote><div><br>Yes

but unfortunately we aren&#39;t talking about octets any more in Python 3,

but characters. If we&#39;re going to follow the standard and encode

octets, then we should be accepting (for quote) and returning (for

unquote) bytes objects, not strings. But as that&#39;s going to break most

existing code and be extremely confusing, I think it&#39;s best we try and

solve this problem for Unicode strings.<br>&nbsp;</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt; Does anyone have a suggestion which will be more compatible with the rest<br>

<div class="Ih2E3d">

&gt; of the world than allowing the user to select an encoding, and defaulting<br>

&gt; to &quot;utf-8&quot;?<br>

<br>

</div>Default to latin-1 for decoding and utf-8 for encoding. This might be<br>

confusing though, so maybe you&#39;ve asked the wrong question ;)<br></blockquote></div><br>:o

that would break so so much existing code, not to mention being

horribly inconsistent and confusing. Having said that, that&#39;s almost

what the current behaviour is (quote uses Latin-1 for characters &lt;

256, and UTF-8 for characters above; unquote uses Latin-1).<br><br>Again

I bring up the http server example. If you go to a directory, create a

file with a name such as &#39;漢字&#39;, and then run this code in Python 3.0

from that directory:<br><pre>import http.server<br>s = http.server.HTTPServer((&#39;&#39;,8000),<br>        http.server.SimpleHTTPRequestHandler)<br>s.serve_forever()<br></pre>You&#39;ll

see the file in the directory listing - its HTML will be &lt;a

href=&quot;%E6%BC%A2%E5%AD%97&quot;&gt;漢字&lt;/a&gt;. But if you click it, you get

a 404 because the server will look for the file named

unquote(&quot;%E6%BC%A2%E5%AD%97&quot;) = &#39;æ¼¢å\xad\x97&#39;.<br>

<br>

If you apply my patch (patch5) *everything* *just* *works*.<br>

<br><br><div class="gmail_quote">On Mon, Jul 14, 2008 at 6:36 AM, Bill Janssen &lt;<a href="mailto:janssen@parc.com">janssen@parc.com</a>&gt; wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="Ih2E3d">&gt; Ah there may be some confusion here. We&#39;re only dealing with str-&gt;str<br>

&gt; transformations (which in Python 3 means Unicode strings). You can&#39;t put a<br>

&gt; bytes in or get a bytes out of either of these functions. I suggested a<br>

&gt; &quot;quote_raw&quot; and &quot;unquote_raw&quot; function which would let you do this.<br>

<br>

</div>Ah, well, that&#39;s a problem. &nbsp;Clearly the unquote is str-&gt;bytes, while<br>

the quote is (bytes OR str)-&gt;str.<br>

<font color="#888888"></font></blockquote><div><br>OK so for quote, you&#39;re suggesting that we accept either a bytes or a str object. That sounds quite reasonable (though neither the unpatched or patched versions accept a bytes at the moment). I&#39;d simply change the code in quote (from patch5) to do this:<br>

<br><font size="2"><span style="font-family: courier new,monospace;">if isinstance(s, str):</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">&nbsp;&nbsp;&nbsp; s = s.encode(encoding, errors)</span><br style="font-family: courier new,monospace;">

<span style="font-family: courier new,monospace;">....</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">res = map(quoter, s)</span></font><br><br>Now you get this behaviour by default (which may appear confusing but I&#39;d argue correct given the different semantics of &#39;h\xfcllo&#39; and b&#39;h\xfcllo&#39;):<br>

<br>&gt;&gt;&gt; urllib.parse.quote(b&#39;h\xfcllo&#39;)<br>&#39;h%FCllo&#39;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Directly-encoded octets<br>&gt;&gt;&gt; urllib.parse.quote(&#39;h\xfcllo&#39;)<br>&#39;h%C3%BCllo&#39;&nbsp;&nbsp;&nbsp;&nbsp; # UTF-8 encoded string, then encoded octets<br>

<br><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">Clearly the unquote is str-&gt;bytes, &lt;snip&gt; You can&#39;t pass a Unicode string back<br>


as the result of unquote *without* passing in an encoding specifier,<br>

because the character set is application-specific.<br></blockquote><br>So for unquote you&#39;re suggesting that it always return a bytes object UNLESS an encoding is specified? As in:<br><br>&gt;&gt;&gt; urllib.parse.unquote(&#39;h%C3%BCllo&#39;)<br>

b&#39;h\xc3\xbcllo&#39;<br><br>I would object to that on two grounds. Firstly, I wouldn&#39;t expect or desire a bytes object. The vast majority of uses for unquote will be to get a character string out, not bytes. Secondly, there is a mountain of code (including about 12 modules in the standard library) which call unquote and don&#39;t give the user the encoding option, so it&#39;s best if we pick a default that is what the majority of users will expect. I argue that that&#39;s UTF-8.<br>

<br>I&#39;d prefer having a separate unquote_raw function which is str-&gt;bytes, and the unquote function performs the same role as it always have, which is str-&gt;str. But I agree on quote, I think it can be (bytes OR str)-&gt;str.<br>

<br>Matt<br></div></div>