[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Wed Jan 8 13:11:40 CET 2014

>>>>> INADA Naoki writes:

 > I share my experience that I've suffered by bytes doesn't have %-format.
 > `MySQL-python is a most major DB-API 2.0 driver for MySQL.
 > MySQL-python uses 'format' paramstyle.

 > MySQL protocol is basically encoded text, but it may contain arbitrary
 > (escaped) binary.
 > Here is simplified example constructing real SQL from SQL format and
 > arguments. (Works only on Python 2.7)

'>' quotes are omitted for clarity and comments deleted.

    def escape_string(s):
        return s.replace("'", "''")

    def convert(x):
        if isinstance(x, unicode):
            x = x.encode('utf-8')
        if isinstance(x, str):
            x = "'" + escape_string(x) + "'"
        else:
            x = str(x)
        return x

    def build_query(query, *args):
        if isinstance(query, unicode):
            query = query.encode('utf-8')
        return query % tuple(map(convert, args))

    textdata = b"hello"
    bindata = b"abc\xff\x00"
    query = "UPDATE table SET textcol=%s bincol=%s"

    print build_query(query, textdata, bindata)

 > I can't port this to Python 3.

Why not?  The obvious translation is

    # This is Python 3!!
    def escape_string(s):
        return s.replace("'", "''")

    def convert(x):
        if isinstance(x, bytes):
            x = escape_string(x.decode('ascii', errors='surrogateescape'))
            x = "'" + x + "'"
        else:
            x = str(x)
        return x

    def build_query(query, *args):
        query = query % tuple(map(convert, args))
        return query.encode('utf-8', errors='surrogateescape')

    textdata = "hello"
    bindata = b"abc\xff\x00"
    query = "UPDATE table SET textcol=%s bincol=%s"

    print build_query(query, textdata, bindata)

The main issue I can think you might have with this is that there will
need to be conversions to and from 16-bit representations, which take
up unnecessary space for bindata, and are relatively slow for bindata.
But it seems to me that these are second-order costs compared to the
other work an adapter needs to do.  What am I missing?

With the proposed 'ascii-compatible' representation, if you have to
handle many MB of binary or textdata with non-ASCII characters,

    def convert(x):
        if isinstance(x, str):
            x = x.encode('utf-8').decode('ascii-compatible')
        elif isinstance(x, bytes):
            x = escape_string(x.decode('ascii-compatible'))
            x = "'" + x + "'"
        else:
            x = str(x)  # like 42
        return x

    def build_query(query, *args):
        query = convert(query) % tuple(map(convert, args))
        return query.encode('utf-8', errors='surrogateescape')

ensures that the '%' format operator is always dealing with 8-bit
representations only.  There might be a conversion from 16-bit to
8-bit for str, but there will be no conversions from 8-bit to 16-bit
representations.  I don't know if that makes '%' itself faster, but
it might.