[Python-Dev] thoughts on the bytes/string discussion

Ian Bicking ianb at colorstudy.com
Thu Jun 24 23:44:12 CEST 2010


On Thu, Jun 24, 2010 at 3:59 PM, Guido van Rossum <guido at python.org> wrote:

> The protocol specs typically go out of their way to specify what byte
> values they use for syntactically significant positions (e.g. ':' in
> headers, or '/' in URLs), while hand-waving about the meaning of "what
> goes in between" since it is all typically treated as "not of
> syntactic significance". So you can write a parser that looks at bytes
> exclusively, and looks for a bunch of ASCII punctuation characters
> (e.g. '<', '>', '/', '&'), and doesn't know or care whether the stuff
> in between is encoded in Latin-15, MacRoman or UTF-8 -- it never looks
> "inside" stretches of characters between the special characters and
> just copies them. (Sometimes there may be *some* sections that are
> required to be ASCII and there equivalence of a-z and A-Z is well
> defined.)
>

Yes, these are the specific characters that I think we can handle
specially.  For instance, the list of all string literals used by urlsplit
and urlunsplit:
'//'
'/'
':'
'?'
'#'
''
'http'
A list of all valid scheme characters (a-z etc)
Some lists for scheme-specific parsing (which all contain valid scheme
characters)

All of these are constrained to ASCII, and must be constrained to ASCII, and
everything else in a URL is treated as basically opaque.

So if we turned these characters into byte-or-str objects I think we'd
basically be true to the intent of the specs, and in a practical sense we'd
be able to make these functions polymorphic.  I suspect this same pattern
will be present most places where people want polymorphic behavior.

For now we could do something incomplete and just avoid using operators we
can't overload (is it possible to at least make them produce a readable
exception?)

I think we'll avoid a lot of the confusion that was present with Python 2 by
not making the coercions transitive.  For instance, here's something that
would work in Python 2:

  urlunsplit(('http', 'example.com', '/foo', u'bar=baz', ''))

And you'd get out a unicode string, except that would break the first time
that query string (u'bar=baz') was not ASCII (but not until then!)

Here's the urlunsplit code:

def urlunsplit(components):
    scheme, netloc, url, query, fragment = components
    if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
        if url and url[:1] != '/': url = '/' + url
        url = '//' + (netloc or '') + url
    if scheme:
        url = scheme + ':' + url
    if query:
        url = url + '?' + query
    if fragment:
        url = url + '#' + fragment
    return url

If all those literals were this new special kind of string, if you call:

  urlunsplit((b'http', b'example.com', b'/foo', 'bar=baz', b''))

You'd end up constructing the URL b'http://example.com/foo' and then
running:

    url = url + special('?') + query

And that would fail because b'http://example.com/foo' + special('?') would
be b'http://example.com/foo?' and you cannot add that to the str 'bar=baz'.
So we'd be avoiding the Python 2 craziness.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100624/6d2089a5/attachment.html>


More information about the Python-Dev mailing list