[DB-SIG] Should Binary accept unicode string?

Mike Bayer mike_mp at zzzcomputing.com
Sat Jan 16 22:51:54 EST 2016



On 01/16/2016 02:30 PM, M.-A. Lemburg wrote:
> On 16.01.2016 16:27, Mike Bayer wrote:
>>
>>
>> On 01/15/2016 05:14 PM, Vernon D. Cole wrote:
>>> Mike:
>>>
>>>   Thank you for your long explanation. I got lost somewhere in the
>>> middle there, though.
>>>
>>> If you are suggesting that better documentation be added to PEP-249,
>>> then perhaps you could include a suggestion as to a (brief) note which
>>> could be appended.
>>
>>
>> Well, I wasn't even going that far.  I'm trying to get a handle on what
>> pep-249's position is as far as portability of datatypes.   It has
>> always struck me as very weak.
>>
>>
>>
>>>
>>> If you are suggesting that the PEP be expanded to provide a service
>>> not now generally available, then perhaps you ought to start the
>>> long-talked-about-but-never-tried task of writing a DBAPI level 3 PEP.
>>
>> Well if a DBAPI driver would like to accept a Python unicode object to a
>> Binary() and produce bytes, some conversion is needed, and there are
>> many possible conversions that could take place - there is every
>> possible encoding, and at typically at least four potential candidates
>> among those available.
>>
>> It's my position that the Binary() type should *not* offer to
>> automatically choose such a conversion and should only accept Python
>> types (or 3rd party extension types, sure) that are explicitly 1-1
>> mappable to a stream of bytes without a "conversion decision" being
>> made.  The type of conversion should not be guessed among a choice of
>> several / hundreds within the Binary type.
>>
>> So definitely, not proposing any new service other than "disallow
>> ambiguous input".
> 
> I still don't understand why you want to restrict Binary()
> to perform automatic conversions on the input types.

it would be consistent with the philosophy of Python 3 itself that
unicode and "bytes" are two different things without an implicit
conversion.   It's also a place that without clear guidance in the pep,
some DBAPIs are going to do, and others not, leading to non-portable code.

Basically, if pep-249 said, "The Binary() object should accept Python
unicode objects and should encode them to bytes using an encoding
indicated by the .encoding attribute on Connection", that would be
better than the current situation of that it says nothing at all.

It's the "it says nothing at all" part here that's more troubling to me,
rather than whether or not a Unicode should be encoded when passed to a
Binary().   If the spec had a recommendation for where this encoding
should come from, then that allows portable code to be written.

> 
> Unicode is just one example of where you can implement such
> conversion, e.g. a database module may want to automatically
> convert Unicode to UTF-8. For database backends which don't
> provide Unicode support, this is usually also being done
> for string parameter types.

it is true that even in Python 3, DBAPIs are obviously taking on the
task of figuring out an appropriate encoding to use for strings, which
are unicode objects in Python 3.   It strongly suggests that pep-249 or
its successor would be served by referring to an "encoding" setting in a
standard way.


> 
> mxODBC, for example, allows setting a per connection .encoding
> attribute to define which encoding to use in such cases.
> 
> But again, Unicode is just one example. Binary() may also
> apply automatic conversions for other types, such as images,
> numeric arrays, etc.
> 
> The DB-API standard cannot define which types to autoconvert
> and which not. This is a conscious decision left to the database
> module authors.
> 
> They have to make similar choices for all other parameter
> types as well, e.g. whether to convert datetime values to
> strings, ticks or whether to reject them.

> In many cases, the database backends don't provide parameter
> type information, so the database module has to decide what
> to do. In other cases, the database module may get type information
> from the database and then has to decide what to do with the
> input parameters passed to it from Python.
> 
> Back to Binary(): What we could do is recommend to use e.g.
> buffer() for Python 2 as default implementation and bytes()
> for Python 3.

I think Binary should certainly accept str in Py2K, since that's the
normal place we get "bytes" from in Py2K.  If it wants to accept a
buffer() also, that's fine (because no guess needs to be made), I doubt
anyone will use it though as its a deprecated type and they can just use
str.

> 
> The fact that buffer() does accept Unicode objects in Python 2
> is due to the way the buffer interface works in Python 2 (in 2000
> we thought it would be a good idea to allow access to
> the UCS-2 data; later on, when we added UCS-4 support, we
> could not easily remove this feature anymore).

yeah UCS-2 / UCS-4, not too useful in the outside world these days :)

> 
> For Python 3, bytes() won't accept Unicode because the buffer
> interface was changed in Python 3 to no longer expose the
> binary buffer interface.
> 
> Would that make you happy ? :-)

it's all great!  I only seek to understand the thinking behind DBAPI's
philosophy in areas like these.


> 


More information about the DB-SIG mailing list