[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Wed Jan 8 14:10:42 CET 2014

You're right.
As I said previous mail, I had not considered about using surrogateescape.

But surrogateescpae is not silverbullet.
Decode with ascii and encode with target encoding is not valid on ascii
compatible encoding.

In [29]: bindata = b'abc'
In [30]: bindata = bindata.decode('ascii', 'surrogateescape')
In [31]: text = 'abc'
In [32]: query = 'SET textcolumn=%s bincolumn=%s' % ("'" + text + "'", "'"
+ bindata + "'")
In [33]: query.encode('utf16', 'surrogateescape')
Out[33]: b"\xff\xfeS\x00E\x00T\x00
\x00t\x00e\x00x\x00t\x00c\x00o\x00l\x00u\x00m\x00n\x00=\x00'\x00a\x00b\x00c\x00'\x00
\x00b\x00i\x00n\x00c\x00o\x00l\x00u\x00m\x00n\x00=\x00'\x00a\x00b\x00c\x00'\x00"

Fortunately, I can't use utf16 as client encoding with MySQL.
mysql> SET NAMES utf16;
ERROR 1231 (42000): Variable 'character_set_client' can't be set to the
value of 'utf16'

On Wed, Jan 8, 2014 at 9:11 PM, Stephen J. Turnbull <stephen at xemacs.org>wrote:

> >>>>> INADA Naoki writes:
>
>  > I share my experience that I've suffered by bytes doesn't have %-format.
>  > `MySQL-python is a most major DB-API 2.0 driver for MySQL.
>  > MySQL-python uses 'format' paramstyle.
>
>  > MySQL protocol is basically encoded text, but it may contain arbitrary
>  > (escaped) binary.
>  > Here is simplified example constructing real SQL from SQL format and
>  > arguments. (Works only on Python 2.7)
>
> '>' quotes are omitted for clarity and comments deleted.
>
>     def escape_string(s):
>         return s.replace("'", "''")
>
>     def convert(x):
>         if isinstance(x, unicode):
>             x = x.encode('utf-8')
>         if isinstance(x, str):
>             x = "'" + escape_string(x) + "'"
>         else:
>             x = str(x)
>         return x
>
>     def build_query(query, *args):
>         if isinstance(query, unicode):
>             query = query.encode('utf-8')
>         return query % tuple(map(convert, args))
>
>     textdata = b"hello"
>     bindata = b"abc\xff\x00"
>     query = "UPDATE table SET textcol=%s bincol=%s"
>
>     print build_query(query, textdata, bindata)
>
>  > I can't port this to Python 3.
>
> Why not?  The obvious translation is
>
>     # This is Python 3!!
>     def escape_string(s):
>         return s.replace("'", "''")
>
>     def convert(x):
>         if isinstance(x, bytes):
>             x = escape_string(x.decode('ascii', errors='surrogateescape'))
>             x = "'" + x + "'"
>         else:
>             x = str(x)
>         return x
>
>     def build_query(query, *args):
>         query = query % tuple(map(convert, args))
>         return query.encode('utf-8', errors='surrogateescape')
>
>     textdata = "hello"
>     bindata = b"abc\xff\x00"
>     query = "UPDATE table SET textcol=%s bincol=%s"
>
>     print build_query(query, textdata, bindata)
>
> The main issue I can think you might have with this is that there will
> need to be conversions to and from 16-bit representations, which take
> up unnecessary space for bindata, and are relatively slow for bindata.
> But it seems to me that these are second-order costs compared to the
> other work an adapter needs to do.  What am I missing?
>
> With the proposed 'ascii-compatible' representation, if you have to
> handle many MB of binary or textdata with non-ASCII characters,
>
>     def convert(x):
>         if isinstance(x, str):
>             x = x.encode('utf-8').decode('ascii-compatible')
>         elif isinstance(x, bytes):
>             x = escape_string(x.decode('ascii-compatible'))
>             x = "'" + x + "'"
>         else:
>             x = str(x)  # like 42
>         return x
>
>     def build_query(query, *args):
>         query = convert(query) % tuple(map(convert, args))
>         return query.encode('utf-8', errors='surrogateescape')
>
> ensures that the '%' format operator is always dealing with 8-bit
> representations only.  There might be a conversion from 16-bit to
> 8-bit for str, but there will be no conversions from 8-bit to 16-bit
> representations.  I don't know if that makes '%' itself faster, but
> it might.
>
>

-- 
INADA Naoki  <songofacandy at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140108/9c7f064c/attachment.html>