On Thu, Jun 24, 2010 at 3:59 PM, Guido van Rossum <span dir="ltr">&lt;<a href="mailto:guido@python.org">guido@python.org</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


The protocol specs typically go out of their way to specify what byte<br>

values they use for syntactically significant positions (e.g. &#39;:&#39; in<br>

headers, or &#39;/&#39; in URLs), while hand-waving about the meaning of &quot;what<br>

goes in between&quot; since it is all typically treated as &quot;not of<br>

syntactic significance&quot;. So you can write a parser that looks at bytes<br>

exclusively, and looks for a bunch of ASCII punctuation characters<br>

(e.g. &#39;&lt;&#39;, &#39;&gt;&#39;, &#39;/&#39;, &#39;&amp;&#39;), and doesn&#39;t know or care whether the stuff<br>

in between is encoded in Latin-15, MacRoman or UTF-8 -- it never looks<br>

&quot;inside&quot; stretches of characters between the special characters and<br>

just copies them. (Sometimes there may be *some* sections that are<br>

required to be ASCII and there equivalence of a-z and A-Z is well<br>

defined.)<br></blockquote></div><br>Yes, these are the specific characters that I think we can handle specially.  For instance, the list of all string literals used by urlsplit and urlunsplit:<br>&#39;//&#39;<br>&#39;/&#39;<br>


&#39;:&#39;<br>&#39;?&#39;<br>&#39;#&#39;<br>&#39;&#39;<br>&#39;http&#39;<br>A list of all valid scheme characters (a-z etc)<br>Some lists for scheme-specific parsing (which all contain valid scheme characters)<br><br>All of these are constrained to ASCII, and must be constrained to ASCII, and everything else in a URL is treated as basically opaque.<br>


<br>So if we turned these characters into byte-or-str objects I think we&#39;d basically be true to the intent of the specs, and in a practical sense we&#39;d be able to make these functions polymorphic.  I suspect this same pattern will be present most places where people want polymorphic behavior.<br>


<br>For now we could do something incomplete and just avoid using operators we can&#39;t overload (is it possible to at least make them produce a readable exception?)<br><br clear="all">I think we&#39;ll avoid a lot of the confusion that was present with Python 2 by not making the coercions transitive.  For instance, here&#39;s something that would work in Python 2:<br>


<br>  urlunsplit((&#39;http&#39;, &#39;<a href="http://example.com">example.com</a>&#39;, &#39;/foo&#39;, u&#39;bar=baz&#39;, &#39;&#39;))<br><br>And you&#39;d get out a unicode string, except that would break the first time that query string (u&#39;bar=baz&#39;) was not ASCII (but not until then!)<br>


<br>Here&#39;s the urlunsplit code:<br><br>def urlunsplit(components):<br>    scheme, netloc, url, query, fragment = components<br>    if netloc or (scheme and scheme in uses_netloc and url[:2] != &#39;//&#39;):<br>        if url and url[:1] != &#39;/&#39;: url = &#39;/&#39; + url<br>


        url = &#39;//&#39; + (netloc or &#39;&#39;) + url<br>    if scheme:<br>        url = scheme + &#39;:&#39; + url<br>    if query:<br>        url = url + &#39;?&#39; + query<br>    if fragment:<br>        url = url + &#39;#&#39; + fragment<br>


    return url<br><br>If all those literals were this new special kind of string, if you call:<br><br>  urlunsplit((b&#39;http&#39;, b&#39;<a href="http://example.com">example.com</a>&#39;, b&#39;/foo&#39;, &#39;bar=baz&#39;, b&#39;&#39;))<br>


<br>You&#39;d end up constructing the URL b&#39;<a href="http://example.com/foo">http://example.com/foo</a>&#39; and then running:<br><br>    url = url + special(&#39;?&#39;) + query<br><br>And that would fail because b&#39;<a href="http://example.com/foo">http://example.com/foo</a>&#39; + special(&#39;?&#39;) would be b&#39;<a href="http://example.com/foo">http://example.com/foo</a>?&#39; and you cannot add that to the str &#39;bar=baz&#39;.  So we&#39;d be avoiding the Python 2 craziness.<br>


<br>-- <br>Ian Bicking  |  <a href="http://blog.ianbicking.org">http://blog.ianbicking.org</a><br>