[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]
Stephen J. Turnbull
stephen at xemacs.org
Wed Jan 8 16:46:14 CET 2014
>>>>> INADA Naoki writes:
> On Wed, Jan 8, 2014 at 7:34 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
>> INADA Naoki <songofacandy at gmail.com> wrote:
> Some encoding doesn't ensure roundtrip.
In that case, in Python 2 you're depending on all "text" to be encoded
in the same encoding. And even so you may be in trouble:
def convert(x):
if isinstance(x, unicode):
x = x.encode(round_trip_not_guaranteed)
could cause your query to fail when it should succeed. 'x' is
user-supplied data, so you have no control over that.
> I may be able to ascii for decoding when mysql uses ascii compatible
> encoding.
You can *always* use 'ascii', 'latin1', or 'utf-8' with
'surrogateescape' for decoding, and roundtrip is guaranteed.
> But I think decode/encode with surrogateescape is not only slow,
Evidence? Especially as compared with the connection overhead of the
DBMS?
> but also dangerous when using encoding except ascii or utf8.
Or latin1.
But here's your code as translated to Python 3.3, assuming a
connection encoding of Shift JIS:
# unchanged source, but this is Python 3 str == Unicode
def escape_string(s):
return s.replace("'", "''")
def convert(x):
if isinstance(x, str): # Correct type unicode->str
x = "'" + escape_string(x) + "'"
elif isinstance(x, bytes): # Correct type str->bytes
# SAFE: ASCII is a Unicode subset, RT guaranteed.
x = x.decode('ascii', errors='surrogateescape')
x = "'" + escape_string(x) + "'"
else:
x = str(x)
return x
def build_query(query, *args):
if isinstance(query, bytes):
# want str for the format operator
query = query.decode('sjis')
query = query % tuple(map(convert, args))
# CORRECT: for ASCII-compatible encodings, including Shift
# JIS and Big 5, since the binary blob doesn't contain any
# non-ASCII characters and the non-character bytes 128-255
# will be restored properly by the error handler.
return query.encode('sjis', errors='surrogate-escape')
textdata = b"hello" # or "hello"
bindata = b"abc\xff\x00"
query = "UPDATE table SET textcol=%s bincol=%s"
print build_query(query, textdata, bindata)
The only problem with correctness will occur if the MySQL connection
uses a non-ASCII-compatible encoding (UTF-16, fixed-width EUC) in the
query string, because the ASCII bytes in the blob will be "widened" by
"encode".
Widechar encodings could actually be handled with a "binary" codec
that recognizes *no* characters and always surrogate-encodes every
byte. But that's pretty obviously going to be unacceptable.
I guess bytes.format() is pretty well unstoppable at this point.
More information about the Python-ideas
mailing list