<div dir="ltr"><div><div><div>You're right.<br>As I said previous mail, I had not considered about using surrogateescape.<br><br></div>But surrogateescpae is not silverbullet.<br></div><div>Decode with ascii and encode with target encoding is not valid on ascii compatible encoding.<br>


</div><br>In [29]: bindata = b'abc'<br>In [30]: bindata = bindata.decode('ascii', 'surrogateescape')<br>In [31]: text = 'abc'<br>In [32]: query = 'SET textcolumn=%s bincolumn=%s' % ("'" + text + "'", "'" + bindata + "'")<br>


In [33]: query.encode('utf16', 'surrogateescape')<br>Out[33]: b"\xff\xfeS\x00E\x00T\x00 \x00t\x00e\x00x\x00t\x00c\x00o\x00l\x00u\x00m\x00n\x00=\x00'\x00a\x00b\x00c\x00'\x00 \x00b\x00i\x00n\x00c\x00o\x00l\x00u\x00m\x00n\x00=\x00'\x00a\x00b\x00c\x00'\x00"<br>


<br></div>Fortunately, I can't use utf16 as client encoding with MySQL.<br>mysql> SET NAMES utf16;<br>ERROR 1231 (42000): Variable 'character_set_client' can't be set to the value of 'utf16'<br>


<div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Jan 8, 2014 at 9:11 PM, Stephen J. Turnbull <span dir="ltr"><<a href="mailto:stephen@xemacs.org" target="_blank">stephen@xemacs.org</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">>>>>> INADA Naoki writes:<br>

<br>

 > I share my experience that I've suffered by bytes doesn't have %-format.<br>

</div> > `MySQL-python is a most major DB-API 2.0 driver for MySQL.<br>

 > MySQL-python uses 'format' paramstyle.<br>

<div class="im"><br>

 > MySQL protocol is basically encoded text, but it may contain arbitrary<br>

 > (escaped) binary.<br>

 > Here is simplified example constructing real SQL from SQL format and<br>

 > arguments. (Works only on Python 2.7)<br>

<br>

</div>'>' quotes are omitted for clarity and comments deleted.<br>

<div class="im"><br>

    def escape_string(s):<br>

        return s.replace("'", "''")<br>

<br>

    def convert(x):<br>

        if isinstance(x, unicode):<br>

            x = x.encode('utf-8')<br>

</div><div class="im">        if isinstance(x, str):<br>

            x = "'" + escape_string(x) + "'"<br>

</div>        else:<br>

            x = str(x)<br>

<div class="im">        return x<br>

<br>

    def build_query(query, *args):<br>

        if isinstance(query, unicode):<br>

            query = query.encode('utf-8')<br>

        return query % tuple(map(convert, args))<br>

<br>

    textdata = b"hello"<br>

    bindata = b"abc\xff\x00"<br>

    query = "UPDATE table SET textcol=%s bincol=%s"<br>

<br>

    print build_query(query, textdata, bindata)<br>

<br>

 > I can't port this to Python 3.<br>

<br>

</div>Why not?  The obvious translation is<br>

<br>

    # This is Python 3!!<br>

<div class="im">    def escape_string(s):<br>

        return s.replace("'", "''")<br>

<br>

    def convert(x):<br>

</div>        if isinstance(x, bytes):<br>

            x = escape_string(x.decode('ascii', errors='surrogateescape'))<br>

            x = "'" + x + "'"<br>

        else:<br>

            x = str(x)<br>

<div class="im">        return x<br>

<br>

    def build_query(query, *args):<br>

</div>        query = query % tuple(map(convert, args))<br>

        return query.encode('utf-8', errors='surrogateescape')<br>

<br>

    textdata = "hello"<br>

<div class="im">    bindata = b"abc\xff\x00"<br>

    query = "UPDATE table SET textcol=%s bincol=%s"<br>

<br>

    print build_query(query, textdata, bindata)<br>

<br>

</div>The main issue I can think you might have with this is that there will<br>

need to be conversions to and from 16-bit representations, which take<br>

up unnecessary space for bindata, and are relatively slow for bindata.<br>

But it seems to me that these are second-order costs compared to the<br>

other work an adapter needs to do.  What am I missing?<br>

<br>

With the proposed 'ascii-compatible' representation, if you have to<br>

handle many MB of binary or textdata with non-ASCII characters,<br>

<br>

    def convert(x):<br>

        if isinstance(x, str):<br>

            x = x.encode('utf-8').decode('ascii-compatible')<br>

        elif isinstance(x, bytes):<br>

            x = escape_string(x.decode('ascii-compatible'))<br>

            x = "'" + x + "'"<br>

<div class="im">        else:<br>

            x = str(x)  # like 42<br>

        return x<br>

<br>

    def build_query(query, *args):<br>

</div>        query = convert(query) % tuple(map(convert, args))<br>

        return query.encode('utf-8', errors='surrogateescape')<br>

<br>

ensures that the '%' format operator is always dealing with 8-bit<br>

representations only.  There might be a conversion from 16-bit to<br>

8-bit for str, but there will be no conversions from 8-bit to 16-bit<br>

representations.  I don't know if that makes '%' itself faster, but<br>

it might.<br>

<br>

</blockquote></div><br><br clear="all"><br>-- <br>INADA Naoki  <<a href="mailto:songofacandy@gmail.com">songofacandy@gmail.com</a>>

</div>