On Thu, Jun 24, 2010 at 3:59 PM, Guido van Rossum <span dir="ltr"><<a href="mailto:guido@python.org">guido@python.org</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
The protocol specs typically go out of their way to specify what byte<br>
values they use for syntactically significant positions (e.g. ':' in<br>
headers, or '/' in URLs), while hand-waving about the meaning of "what<br>
goes in between" since it is all typically treated as "not of<br>
syntactic significance". So you can write a parser that looks at bytes<br>
exclusively, and looks for a bunch of ASCII punctuation characters<br>
(e.g. '<', '>', '/', '&'), and doesn't know or care whether the stuff<br>
in between is encoded in Latin-15, MacRoman or UTF-8 -- it never looks<br>
"inside" stretches of characters between the special characters and<br>
just copies them. (Sometimes there may be *some* sections that are<br>
required to be ASCII and there equivalence of a-z and A-Z is well<br>
defined.)<br></blockquote></div><br>Yes, these are the specific characters that I think we can handle specially. For instance, the list of all string literals used by urlsplit and urlunsplit:<br>'//'<br>'/'<br>
':'<br>'?'<br>'#'<br>''<br>'http'<br>A list of all valid scheme characters (a-z etc)<br>Some lists for scheme-specific parsing (which all contain valid scheme characters)<br><br>All of these are constrained to ASCII, and must be constrained to ASCII, and everything else in a URL is treated as basically opaque.<br>
<br>So if we turned these characters into byte-or-str objects I think we'd basically be true to the intent of the specs, and in a practical sense we'd be able to make these functions polymorphic. I suspect this same pattern will be present most places where people want polymorphic behavior.<br>
<br>For now we could do something incomplete and just avoid using operators we can't overload (is it possible to at least make them produce a readable exception?)<br><br clear="all">I think we'll avoid a lot of the confusion that was present with Python 2 by not making the coercions transitive. For instance, here's something that would work in Python 2:<br>
<br> urlunsplit(('http', '<a href="http://example.com">example.com</a>', '/foo', u'bar=baz', ''))<br><br>And you'd get out a unicode string, except that would break the first time that query string (u'bar=baz') was not ASCII (but not until then!)<br>
<br>Here's the urlunsplit code:<br><br>def urlunsplit(components):<br> scheme, netloc, url, query, fragment = components<br> if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):<br> if url and url[:1] != '/': url = '/' + url<br>
url = '//' + (netloc or '') + url<br> if scheme:<br> url = scheme + ':' + url<br> if query:<br> url = url + '?' + query<br> if fragment:<br> url = url + '#' + fragment<br>
return url<br><br>If all those literals were this new special kind of string, if you call:<br><br> urlunsplit((b'http', b'<a href="http://example.com">example.com</a>', b'/foo', 'bar=baz', b''))<br>
<br>You'd end up constructing the URL b'<a href="http://example.com/foo">http://example.com/foo</a>' and then running:<br><br> url = url + special('?') + query<br><br>And that would fail because b'<a href="http://example.com/foo">http://example.com/foo</a>' + special('?') would be b'<a href="http://example.com/foo">http://example.com/foo</a>?' and you cannot add that to the str 'bar=baz'. So we'd be avoiding the Python 2 craziness.<br>
<br>-- <br>Ian Bicking | <a href="http://blog.ianbicking.org">http://blog.ianbicking.org</a><br>