<br><br><div class="gmail_quote">On Mon, Jul 14, 2008 at 4:54 AM, André Malo <<a href="mailto:nd@perlig.de">nd@perlig.de</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="Ih2E3d"><br>
</div>Ahem. The HTTP standard does ;-)<br>
</blockquote><div><br>Really? Can you include
a quotation please? The HTTP standard talks a lot about ISO-8859-1
(Latin-1) in terms of actually raw encoded bytes, but not in terms of
URI percent-encoding (a different issue) as far as I can tell.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">
<br>
> Where web forms are concerned, basically setting the form accept-charset<br>
> or the page charset is the *maximum amount* of control you have over the<br>
> encoding. As you say, it can be encoded by another page or the user can<br>
> override their settings. Then what can you do as the server? Nothing ...<br>
<br>
</div>Guessing works pretty well in most of the cases.<br>
</blockquote><div><br>Are you suggesting that
urllib.unquote guess the encoding? It could do that but it would make
things rather unpredictable. I think if this was an application (such as a web
browser), then guessing is OK. But this is a library function. Library
functions should not make arbitrary decisions; they should be
well-specified.<br><br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Latin-1 is not exactly arbitray. Besides being a charset - it maps<br>
one-to-one to octet values, hence it's commonly used to encode octets and<br>
is therefore a better fallback than every other encoding.<br>
</blockquote><div><br>True. So the only
advantage I see to the current implementation is that if you really
want to, you can take the Latin-1-decoded URI (from unquote) and
explicitly encode it as Latin-1 and then decode it again as whatever
encoding you want. But that would be a hack, would it not? I'd prefer
if the library didn't require a hack just to get the extremely common
use case (UTF-8).<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><br>
> I agree. However if there *was* a proper standard we wouldn't have to<br>
> argue! "Most proper" and "should do" is the most confident we can be when<br>
> dealing with this standard, as there is no correct encoding.<br>
<br>
</div>Well, the standard says, there are octets to be encoded. I find that proper<br>
enough.</blockquote><div><br>Yes
but unfortunately we aren't talking about octets any more in Python 3,
but characters. If we're going to follow the standard and encode
octets, then we should be accepting (for quote) and returning (for
unquote) bytes objects, not strings. But as that's going to break most
existing code and be extremely confusing, I think it's best we try and
solve this problem for Unicode strings.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">> Does anyone have a suggestion which will be more compatible with the rest<br>
<div class="Ih2E3d">
> of the world than allowing the user to select an encoding, and defaulting<br>
> to "utf-8"?<br>
<br>
</div>Default to latin-1 for decoding and utf-8 for encoding. This might be<br>
confusing though, so maybe you've asked the wrong question ;)<br></blockquote></div><br>:o
that would break so so much existing code, not to mention being
horribly inconsistent and confusing. Having said that, that's almost
what the current behaviour is (quote uses Latin-1 for characters <
256, and UTF-8 for characters above; unquote uses Latin-1).<br><br>Again
I bring up the http server example. If you go to a directory, create a
file with a name such as '漢字', and then run this code in Python 3.0
from that directory:<br><pre>import http.server<br>s = http.server.HTTPServer(('',8000),<br> http.server.SimpleHTTPRequestHandler)<br>s.serve_forever()<br></pre>You'll
see the file in the directory listing - its HTML will be <a
href="%E6%BC%A2%E5%AD%97">漢字</a>. But if you click it, you get
a 404 because the server will look for the file named
unquote("%E6%BC%A2%E5%AD%97") = 'æ¼¢å\xad\x97'.<br>
<br>
If you apply my patch (patch5) *everything* *just* *works*.<br>
<br><br><div class="gmail_quote">On Mon, Jul 14, 2008 at 6:36 AM, Bill Janssen <<a href="mailto:janssen@parc.com">janssen@parc.com</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="Ih2E3d">> Ah there may be some confusion here. We're only dealing with str->str<br>
> transformations (which in Python 3 means Unicode strings). You can't put a<br>
> bytes in or get a bytes out of either of these functions. I suggested a<br>
> "quote_raw" and "unquote_raw" function which would let you do this.<br>
<br>
</div>Ah, well, that's a problem. Clearly the unquote is str->bytes, while<br>
the quote is (bytes OR str)->str.<br>
<font color="#888888"></font></blockquote><div><br>OK so for quote, you're suggesting that we accept either a bytes or a str object. That sounds quite reasonable (though neither the unpatched or patched versions accept a bytes at the moment). I'd simply change the code in quote (from patch5) to do this:<br>
<br><font size="2"><span style="font-family: courier new,monospace;">if isinstance(s, str):</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> s = s.encode(encoding, errors)</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">....</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">res = map(quoter, s)</span></font><br><br>Now you get this behaviour by default (which may appear confusing but I'd argue correct given the different semantics of 'h\xfcllo' and b'h\xfcllo'):<br>
<br>>>> urllib.parse.quote(b'h\xfcllo')<br>'h%FCllo' # Directly-encoded octets<br>>>> urllib.parse.quote('h\xfcllo')<br>'h%C3%BCllo' # UTF-8 encoded string, then encoded octets<br>
<br><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">Clearly the unquote is str->bytes, <snip> You can't pass a Unicode string back<br>
as the result of unquote *without* passing in an encoding specifier,<br>
because the character set is application-specific.<br></blockquote><br>So for unquote you're suggesting that it always return a bytes object UNLESS an encoding is specified? As in:<br><br>>>> urllib.parse.unquote('h%C3%BCllo')<br>
b'h\xc3\xbcllo'<br><br>I would object to that on two grounds. Firstly, I wouldn't expect or desire a bytes object. The vast majority of uses for unquote will be to get a character string out, not bytes. Secondly, there is a mountain of code (including about 12 modules in the standard library) which call unquote and don't give the user the encoding option, so it's best if we pick a default that is what the majority of users will expect. I argue that that's UTF-8.<br>
<br>I'd prefer having a separate unquote_raw function which is str->bytes, and the unquote function performs the same role as it always have, which is str->str. But I agree on quote, I think it can be (bytes OR str)->str.<br>
<br>Matt<br></div></div>