[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]
Stephen J. Turnbull
stephen at xemacs.org
Wed Jan 8 13:11:40 CET 2014
>>>>> INADA Naoki writes:
> I share my experience that I've suffered by bytes doesn't have %-format.
> `MySQL-python is a most major DB-API 2.0 driver for MySQL.
> MySQL-python uses 'format' paramstyle.
> MySQL protocol is basically encoded text, but it may contain arbitrary
> (escaped) binary.
> Here is simplified example constructing real SQL from SQL format and
> arguments. (Works only on Python 2.7)
'>' quotes are omitted for clarity and comments deleted.
def escape_string(s):
return s.replace("'", "''")
def convert(x):
if isinstance(x, unicode):
x = x.encode('utf-8')
if isinstance(x, str):
x = "'" + escape_string(x) + "'"
else:
x = str(x)
return x
def build_query(query, *args):
if isinstance(query, unicode):
query = query.encode('utf-8')
return query % tuple(map(convert, args))
textdata = b"hello"
bindata = b"abc\xff\x00"
query = "UPDATE table SET textcol=%s bincol=%s"
print build_query(query, textdata, bindata)
> I can't port this to Python 3.
Why not? The obvious translation is
# This is Python 3!!
def escape_string(s):
return s.replace("'", "''")
def convert(x):
if isinstance(x, bytes):
x = escape_string(x.decode('ascii', errors='surrogateescape'))
x = "'" + x + "'"
else:
x = str(x)
return x
def build_query(query, *args):
query = query % tuple(map(convert, args))
return query.encode('utf-8', errors='surrogateescape')
textdata = "hello"
bindata = b"abc\xff\x00"
query = "UPDATE table SET textcol=%s bincol=%s"
print build_query(query, textdata, bindata)
The main issue I can think you might have with this is that there will
need to be conversions to and from 16-bit representations, which take
up unnecessary space for bindata, and are relatively slow for bindata.
But it seems to me that these are second-order costs compared to the
other work an adapter needs to do. What am I missing?
With the proposed 'ascii-compatible' representation, if you have to
handle many MB of binary or textdata with non-ASCII characters,
def convert(x):
if isinstance(x, str):
x = x.encode('utf-8').decode('ascii-compatible')
elif isinstance(x, bytes):
x = escape_string(x.decode('ascii-compatible'))
x = "'" + x + "'"
else:
x = str(x) # like 42
return x
def build_query(query, *args):
query = convert(query) % tuple(map(convert, args))
return query.encode('utf-8', errors='surrogateescape')
ensures that the '%' format operator is always dealing with 8-bit
representations only. There might be a conversion from 16-bit to
8-bit for str, but there will be no conversions from 8-bit to 16-bit
representations. I don't know if that makes '%' itself faster, but
it might.
More information about the Python-ideas
mailing list