[DB-SIG] Should Binary accept unicode string?

Fri Jan 15 12:21:33 EST 2016

On 01/15/2016 11:15 AM, M.-A. Lemburg wrote:
> On 15.01.2016 16:52, Mike Bayer wrote:
>>
>>
>> On 01/15/2016 09:47 AM, M.-A. Lemburg wrote:
>>> On 12.01.2016 13:59, INADA Naoki wrote:
>>>> Hi, all.
>>>>
>>>> I found DB-API 2.0 defines Binary() as Binary(string).
>>>> https://www.python.org/dev/peps/pep-0249/#binary
>>>>
>>>> What the string means?
>>>> On Python 2, should Binary accept unicode?
>>>> On Python 3, should Binary accept str?
>>>
>>> The Binary() wrapper is intended to provide extra information
>>> for the database module and marks the intent of the user to have
>>> the input parameter be bound to the binding parameter as
>>> binary rather than text (e.g. VARBINARY rather than VARCHAR).
>>>
>>> For Python 2, you'd probably use something like Binary=buffer.
>>> On Python 3, Binary=bytes or Binary=bytearray seem like a natural
>>> choices.
>>>
>>> The choice of possible input parameters for Binary() is
>>> really up to the database module author.
>>
>> I still don't understand this philosophy of pep-249.   Allowing DBAPIs
>> to arbitrarily decide how strict / loose they want to be for
>> user-defined data passed to even very well known datatypes has a
>> negative impact on portability.   It means that code I write for one
>> DBAPI will fail on another.   Is it your view that databases and DBAPIs
>> are so fundamentally different, even for basic things like
>> unicodes/bytes, that attempting to provide for portability is hopeless?
>>  Why even have a pep-249 if I should expect that I have to rewrite my
>> whole application when switching DBAPIs anyway?
>>
>> Obviosly, full portability between DBAPIs and databases is never going
>> to be possible.  But for easy things where a pro-portability decision is
>> clearly very feasible, like, "do / don't accept a unicode object for a
>> bytes type", why can't a decision be made?
> 
> I think you are misunderstanding the purpose of the Binary() helper:
> This was added as a portable way to tell the database interface
> to bind data as binary to the parameter, nothing more.

I fully understand its purpose.  I'm referring to the scope of what we
mean by "bind data".

> 
> Since some database modules rely on the Python type of the
> input parameters to tell whether to bind as binary,
> character, numeric, etc., but did have the distinction between
> binary and text data in Python 2, as we now do in Python 3,
> the Binary() wrapper was added to make the distinction clear.

> 
> The types Binary() allows as input are not part of the DB-API,
> just like we don't make any comments about the allowed input
> types for any other parameter type the database interface
> may support.

Yes, and it is this philosophy I am commenting on - when it is very
obvious that the Python data in question does not align to the database
type being referred to without a conversion taking place, where that
conversion is non-obvious (e.g. a "guess" must be made).

This is not the same as when we have say a Python datetime.datetime()
object being mapped to a TIMESTAMP field - there's a natural conversion
that can take place here, assuming the types line up as far as presence
or not of fields like microseconds or timezone.   The fields of datetime
objects can line up exactly against a backend database without the need
for implicit decisionmaking about format.

Whereas if we tried to map an integer value to a TIMESTAMP field, this
is not a natural conversion - the integer can mean all kinds of things
like the epoch in seconds, days, etc.   If this data is not accepted by
the database directly and the driver makes an arbitrary choice to make
this the epoch in days since 1970, that's pretty arbitrary.  It would be
surprising behavior.

> 
> This adds flexibility and makes it possible to create
> interface modules which support a great deal more than
> just a few standard Python data types.

that is absolutely true, however, I don't see how you can deny that
portability is negatively impacted when on one DBAPI I can say:

cursor.execute(
    "insert into table (q) values (?)", [Binary(u'some unicode')]
)

and on another, it fails, and I instead have to type:

cursor.execute(
     "insert into table (q) values (?)", [Binary(u'some
unicode'.encode('utf-8'))]
)

The latter bit of code in fact can work on *both* systems.  The first
bit of code, cannot.  By disallowing the first style, the developer is
encouraged to write portable code.  By allowing it, developers are led
into writing code that is not portable.

This is portability in a nutshell, and I'm sure you understand this.
The emphasis pep-249 puts on "flexibility" is often at the cost of
encouraging "portability", even in very obvious cases like this one, is
the philosophy I find questionable, being that this is Python and not Perl.

> 
> Back to the choices I mentioned for Binary():
> 
> In Python 2, buffer() does allow unicode objects
> on input, and what you as result corresponds to the binary
> representation of the unicode object as used by Python.

The buffer() type produces Python's internal representation of the
unicode data; it does not, for example, try to first encode the unicode
object based on the platform charset.   For the use case people expect
when converting unicode objects to bytes, the behavior of buffer() is
completely surprising:

>>> list(buffer('f'))
['f']
>>> list(buffer(u'f'))
['f', '\x00', '\x00', '\x00']

The latter is the utf-32 representation that my Python interpreter is
using.  Other Python interpreters, like the one on my mac, return a
utf-16 representation:

>>> list(buffer(u'f'))
['f', '\x00']

neither the utf-16 or utf-32 formats are commonly used with database
applications as a transport format; utf-8 is vastly more common.  It
should not be controversial that exposing the internal Python Unicode
format, which isn't even portable across individual Python builds, is
entirely inappropriate for any kind of data exchange between systems.

That the buffer() construct was removed in Python 3 and replaced with
memoryview() which does not accept strings at all is part of the bigger
story that Python 2's unicode story, including the behavior of buffer(),
was considered to be a design mistake.

> 
> In Python 3, bytes() requires to be more specific and you
> need to provide an encoding. The result is an encoded binary
> version of the text input.
> 
> Both are reasonable choices for a Binary() wrapper and
> the result is easy to detect as "bind me as binary data"
> for the database module.
> 
> It is not uncommon to convert text data to
> binary data for storage, esp. when dealing with larger
> blobs you just want to manage and not work on,

that's not controversial.  But having the driver make the following guess:

1. unicode passed, use buffer() to return the Python interpreter's
internal representation of it (or utf-32 always, or utf-16 always, or.. ?)

2. unicode passed, try to encode with utf-8 because that's "probably"
want someone wants

3. unicode passed, try to encode with the particular charset that's
configured with this driver (if the driver supports configured charsets
at all)

I would argue this scheme works against portability, because it is
impossible to avoid choosing an arbitrary conversion scheme.   This
arbitrary-ness is a design mistake that Python 3 repaired.

 or when
> you want to preserve it in exactly the same form you pass
> it to the database (without any implicit normalizations,
> surrogate conversions, warnings, etc.).

The "form" of Python Unicode data outside of the interpreter is
essentially undefined.  This would mean one is storing data in their
database that not only can't be read portably between major versions of
Python, it can't be read portably between *builds* of the same Python
version.   Developers would vastly prefer their application raise an
error rather than implicitly pass the internal Unicode format of data to
their database, since this is not a use case anyone has.