[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Wed Jan 8 16:46:14 CET 2014

>>>>> INADA Naoki writes:
 > On Wed, Jan 8, 2014 at 7:34 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
 >> INADA Naoki <songofacandy at gmail.com> wrote:

 > Some encoding doesn't ensure roundtrip.

In that case, in Python 2 you're depending on all "text" to be encoded
in the same encoding.  And even so you may be in trouble:

    def convert(x):
        if isinstance(x, unicode):
            x = x.encode(round_trip_not_guaranteed)

could cause your query to fail when it should succeed.  'x' is
user-supplied data, so you have no control over that.

 > I may be able to ascii for decoding when mysql uses ascii compatible
 > encoding.

You can *always* use 'ascii', 'latin1', or 'utf-8' with
'surrogateescape' for decoding, and roundtrip is guaranteed.

 > But I think decode/encode with surrogateescape is not only slow,

Evidence?  Especially as compared with the connection overhead of the
DBMS?

 > but also dangerous when using encoding except ascii or utf8.

Or latin1.

But here's your code as translated to Python 3.3, assuming a
connection encoding of Shift JIS:

    # unchanged source, but this is Python 3 str == Unicode
    def escape_string(s):
        return s.replace("'", "''")

    def convert(x):
        if isinstance(x, str):                # Correct type unicode->str
            x = "'" + escape_string(x) + "'"
        elif isinstance(x, bytes):            # Correct type str->bytes
            # SAFE: ASCII is a Unicode subset, RT guaranteed.
            x = x.decode('ascii', errors='surrogateescape')
            x = "'" + escape_string(x) + "'"
        else:
            x = str(x)
        return x

    def build_query(query, *args):
        if isinstance(query, bytes):
            # want str for the format operator
            query = query.decode('sjis')
        query = query % tuple(map(convert, args))
        # CORRECT: for ASCII-compatible encodings, including Shift
        # JIS and Big 5, since the binary blob doesn't contain any
        # non-ASCII characters and the non-character bytes 128-255
        # will be restored properly by the error handler.
        return query.encode('sjis', errors='surrogate-escape')

    textdata = b"hello"            # or "hello"
    bindata = b"abc\xff\x00"
    query = "UPDATE table SET textcol=%s bincol=%s"

    print build_query(query, textdata, bindata)

The only problem with correctness will occur if the MySQL connection
uses a non-ASCII-compatible encoding (UTF-16, fixed-width EUC) in the
query string, because the ASCII bytes in the blob will be "widened" by
"encode".

Widechar encodings could actually be handled with a "binary" codec
that recognizes *no* characters and always surrogate-encodes every
byte.  But that's pretty obviously going to be unacceptable.

I guess bytes.format() is pretty well unstoppable at this point.