[Python-ideas] Fixing the Python 3 bytes constructor

Sun Mar 30 08:31:37 CEST 2014

On 30 March 2014 16:10, Gregory P. Smith <greg at krypto.org> wrote:
>> Open questions
>> ^^^^^^^^^^^^^^
>>
>> * Should ``bytearray.byte()`` also be added? Or is
>>   ``bytearray(bytes.byte(x))`` sufficient for that case?
>> * Should ``bytes.from_len()`` also be added? Or is sequence repetition
>>   sufficient for that case?
>
> I prefer keeping them consistent across the types myself.
>
>> * Should ``bytearray.from_len()`` use a different name?
>
> This name works for me.
>
>>
>> * Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary
>>   sequences with more than one element? The ``TypeError`` currently
>> proposed
>>   is copied (with slightly improved wording) from the behaviour of
>> ``ord()``
>>   with sequences containing more than one code point, while ``ValueError``
>>   would be more consistent with the existing handling of out-of-range
>>   integer values.
>> * ``bytes.byte()`` is defined above as accepting length 1 binary sequences
>>   as individual bytes, but this is currently inconsistent with the main
>>   ``bytes`` constructor::
>
>
> I don't like that bytes.byte() would accept anything other than an int. It
> should not accept length 1 binary sequences at all.  I'd prefer to see
> bytes.byte(b"X") raise a TypeError.

Unfortunately, it's not that simple, because accepting both is the
only way I see of rendering the current APIs coherent. The problem is
that the str-derived APIs expect bytes objects, the bytearray mutating
methods expect integers, and in Python 3.3, the substring search APIs
were updated to accept both. This means we currently have:

>>> data = bytes([1, 2, 3, 4])
>>> 3 in data
True
>>> b"\x03" in data
True
>>> data.count(3)
1
>>> data.count(b"\x03")
1
>>> data.replace(3, 4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: expected bytes, bytearray or buffer compatible object
>>> data.replace(b"\x03", b"\x04")
b'\x01\x02\x04\x04'
>>> mutable = bytearray(data)
>>> mutable
bytearray(b'\x01\x02\x03\x04')
>>> mutable.append(b"\x05")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: an integer is required
>>> mutable.append(5)
>>> mutable
bytearray(b'\x01\x02\x03\x04\x05')

Since some APIs work one way, some work the other, the only backwards
compatible path I see to consistency is to always treat a length 1
byte string as an acceptable input for the APIs that currently accept
an integer and vice-versa.

That said, I think this hybrid nature accurately reflects the fact
that indexing and slicing bytes objects in Python 3 return different
types - the individual elements are integers, but the subsequences are
bytes objects, and several of these APIs are either
"element-or-subsequence" APIs (in which case they should accept both),
or else they *should* have been element APIs, but currently expect a
subsequence due to their Python 2 str heritage.

If we had the opportunity to redesign these APIs from scratch, we'd
likely make a much clearer distinction between element based APIs
(that would use integers) and subsequence APIs (that would accept
buffer implementing objects). As it is, I think the situation is
inherently ambiguous, and providing hybrid APIs to help deal with that
ambiguity is our best available option.

>> For ``bytearray``, some additional changes are proposed to the current
>> integer based operations to ensure they remain consistent with the
>> proposed
>> constructor changes::
>>
>> * ``append()``: updated to be consistent with ``bytes.byte()``
>> * ``remove()``: updated to be consistent with ``bytes.byte()``
>> * ``+=``: updated to be consistent with ``bytes()`` changes (if any)
>
>
> Where was a change to += behavior mentioned? I don't see that above (or did
> I miss something?).

It was an open question against the constructors - if bytes.byte() is
defined as the PEP suggests, then the case can be made that the
iterables accepted by the bytes() constructor should also be made more
permissive in terms of the contents of the iterables it accepts. If
*that* happens, then extending an existing bytearray should also
become more permissive.

Note that I'm not sold on actually changing that - that's why it's an
open question, rather than something the PEP is currently proposing.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia