On Wed, Jun 23, 2010 at 10:30 AM, Tres Seaver <span dir="ltr"><<a href="mailto:tseaver@palladion.com">tseaver@palladion.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">
Stephen J. Turnbull wrote:<br>
<br>
> We do need str-based implementations of modules like urllib.<br>
<br>
</div><br>Why would that be? URLs aren't text, and never will be. The fact that<br>
to the eye they may seem to be text-ish doesn't make them text. This<br>
*is* a case where "dont make me think" is a losing propsition:<br>
programmers who work with URLs in any non-opaque way as text are<br>
eventually going to be bitten by this issue no matter how hard we wave<br>
our hands.<br></blockquote></div><br>HTML is text, and URLs are embedded in that text, so it's easy to get a URL that is text. Though, with a little testing, I notice that text alone can't tell you what the right URL really is (at least the intended URL when unsafe characters are embedded in HTML).<br>
<br>To test I created two pages, one in Latin-1 another in UTF-8, and put in the link:<br><br> ./test.html?param=Réunion<br><br>On a Latin-1 page it created a link to test.html?param=R%E9union and on a UTF-8 page it created a link to test.html?param=R%C3%A9union (the second link displays in the URL bar as test.html?param=Réunion but copies with percent encoding). Though if you link to ./Réunion.html then both pages create UTF-8 links. And both pages also link <a href="http://xn--runion-bva.com">http://Réunion.com</a> to <a href="http://xn--runion-bva.com/">http://xn--runion-bva.com/</a>. So really neither bytes nor text works completely; query strings receive the encoding of the page, which would be handled transparently if you worked on the page's bytes. Path and domain are consistently encoded with UTF-8 and punycode respectively and so would be handled best when treated as text. And of course if you are a page with a non-ASCII-compatible encoding you really must handle encodings before the URL is sensible.<br>
<br>Another issue here is that there's no "encoding" for turning a URL into bytes if the URL is not already ASCII. A proper way to encode a URL would be:<br><br>(Totally as an aside, as I remind myself of new module names I notice it's not easy to google specifically for Python 3 docs, e.g. "python 3 urlsplit" gives me 2.6 docs)<br>
<br>from urllib.parse import urlsplit, urlunsplit<br>import encodings.idna<br><br>def encode_http_url(url, page_encoding='ASCII', errors='strict'):<br> scheme, netloc, path, query, fragment = urlsplit(url)<br>
scheme = scheme.encode('ASCII', errors)<br>
auth = port = None<br> if '@' in netloc:<br> auth, netloc = netloc.split('@', 1)<br> if ':' in netloc:<br> netloc, port = netloc.split(':', 1)<br> netloc = encodings.idna.ToASCII(netloc)<br>
if port:<br> netloc = netloc + b':' + port.encode('ASCII', errors)<br> if auth:<br> netloc = auth.encode('UTF-8', errors) + b'@' + netloc<br> path = path.encode('UTF-8', errors)<br>
query = query.encode(page_encoding, errors)<br> fragment = fragment.encode('UTF-8', errors)<br> return urlunsplit_bytes((scheme, netloc, path, query, fragment))<br><br>Where urlunsplit_bytes handles bytes (urlunsplit does not). It's helpful for me at least to look at that code specifically:<br>
<br>def urlunsplit(components):<br> scheme, netloc, url, query, fragment = components<br> if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):<br> if url and url[:1] != '/': url = '/' + url<br>
url = '//' + (netloc or '') + url<br> if scheme:<br> url = scheme + ':' + url<br> if query:<br> url = url + '?' + query<br> if fragment:<br> url = url + '#' + fragment<br>
return url<br><br>In this case it really would be best to have Python 2's system where things are coerced to ASCII implicitly. Or, more specifically, if all those string literals in that routine could be implicitly converted to bytes using ASCII. Conceptually I think this is reasonable, as for URLs (at least with HTTP, but in practice I think this applies to all URLs) the ASCII bytes really do have meaning. That is, '/' (*in the context of urlunsplit*) really is \x2f specifically. Or another example, making a GET request really means sending the bytes \x47\x45\x54 and there is no other set of bytes that has that meaning. The WebSockets specification for instance defines things like "colon": <a href="http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-76#page-5">http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-76#page-5</a> -- in an earlier version they even used bytes to describe HTTP (<a href="http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-54#page-13">http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-54#page-13</a>), though this annoyed many people.<br>
<br>-- <br>Ian Bicking | <a href="http://blog.ianbicking.org">http://blog.ianbicking.org</a><br>