<div dir="ltr"><div><div><div>You're right.<br>As I said previous mail, I had not considered about using surrogateescape.<br><br></div>But surrogateescpae is not silverbullet.<br></div><div>Decode with ascii and encode with target encoding is not valid on ascii compatible encoding.<br>
</div><br>In [29]: bindata = b'abc'<br>In [30]: bindata = bindata.decode('ascii', 'surrogateescape')<br>In [31]: text = 'abc'<br>In [32]: query = 'SET textcolumn=%s bincolumn=%s' % ("'" + text + "'", "'" + bindata + "'")<br>
In [33]: query.encode('utf16', 'surrogateescape')<br>Out[33]: b"\xff\xfeS\x00E\x00T\x00 \x00t\x00e\x00x\x00t\x00c\x00o\x00l\x00u\x00m\x00n\x00=\x00'\x00a\x00b\x00c\x00'\x00 \x00b\x00i\x00n\x00c\x00o\x00l\x00u\x00m\x00n\x00=\x00'\x00a\x00b\x00c\x00'\x00"<br>
<br></div>Fortunately, I can't use utf16 as client encoding with MySQL.<br>mysql> SET NAMES utf16;<br>ERROR 1231 (42000): Variable 'character_set_client' can't be set to the value of 'utf16'<br>
<div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Jan 8, 2014 at 9:11 PM, Stephen J. Turnbull <span dir="ltr"><<a href="mailto:stephen@xemacs.org" target="_blank">stephen@xemacs.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">>>>>> INADA Naoki writes:<br>
<br>
> I share my experience that I've suffered by bytes doesn't have %-format.<br>
</div> > `MySQL-python is a most major DB-API 2.0 driver for MySQL.<br>
> MySQL-python uses 'format' paramstyle.<br>
<div class="im"><br>
> MySQL protocol is basically encoded text, but it may contain arbitrary<br>
> (escaped) binary.<br>
> Here is simplified example constructing real SQL from SQL format and<br>
> arguments. (Works only on Python 2.7)<br>
<br>
</div>'>' quotes are omitted for clarity and comments deleted.<br>
<div class="im"><br>
def escape_string(s):<br>
return s.replace("'", "''")<br>
<br>
def convert(x):<br>
if isinstance(x, unicode):<br>
x = x.encode('utf-8')<br>
</div><div class="im"> if isinstance(x, str):<br>
x = "'" + escape_string(x) + "'"<br>
</div> else:<br>
x = str(x)<br>
<div class="im"> return x<br>
<br>
def build_query(query, *args):<br>
if isinstance(query, unicode):<br>
query = query.encode('utf-8')<br>
return query % tuple(map(convert, args))<br>
<br>
textdata = b"hello"<br>
bindata = b"abc\xff\x00"<br>
query = "UPDATE table SET textcol=%s bincol=%s"<br>
<br>
print build_query(query, textdata, bindata)<br>
<br>
> I can't port this to Python 3.<br>
<br>
</div>Why not? The obvious translation is<br>
<br>
# This is Python 3!!<br>
<div class="im"> def escape_string(s):<br>
return s.replace("'", "''")<br>
<br>
def convert(x):<br>
</div> if isinstance(x, bytes):<br>
x = escape_string(x.decode('ascii', errors='surrogateescape'))<br>
x = "'" + x + "'"<br>
else:<br>
x = str(x)<br>
<div class="im"> return x<br>
<br>
def build_query(query, *args):<br>
</div> query = query % tuple(map(convert, args))<br>
return query.encode('utf-8', errors='surrogateescape')<br>
<br>
textdata = "hello"<br>
<div class="im"> bindata = b"abc\xff\x00"<br>
query = "UPDATE table SET textcol=%s bincol=%s"<br>
<br>
print build_query(query, textdata, bindata)<br>
<br>
</div>The main issue I can think you might have with this is that there will<br>
need to be conversions to and from 16-bit representations, which take<br>
up unnecessary space for bindata, and are relatively slow for bindata.<br>
But it seems to me that these are second-order costs compared to the<br>
other work an adapter needs to do. What am I missing?<br>
<br>
With the proposed 'ascii-compatible' representation, if you have to<br>
handle many MB of binary or textdata with non-ASCII characters,<br>
<br>
def convert(x):<br>
if isinstance(x, str):<br>
x = x.encode('utf-8').decode('ascii-compatible')<br>
elif isinstance(x, bytes):<br>
x = escape_string(x.decode('ascii-compatible'))<br>
x = "'" + x + "'"<br>
<div class="im"> else:<br>
x = str(x) # like 42<br>
return x<br>
<br>
def build_query(query, *args):<br>
</div> query = convert(query) % tuple(map(convert, args))<br>
return query.encode('utf-8', errors='surrogateescape')<br>
<br>
ensures that the '%' format operator is always dealing with 8-bit<br>
representations only. There might be a conversion from 16-bit to<br>
8-bit for str, but there will be no conversions from 8-bit to 16-bit<br>
representations. I don't know if that makes '%' itself faster, but<br>
it might.<br>
<br>
</blockquote></div><br><br clear="all"><br>-- <br>INADA Naoki <<a href="mailto:songofacandy@gmail.com">songofacandy@gmail.com</a>>
</div>