[Python-Dev] PEP 467: Minor API improvements for bytes & bytearray

Fri Aug 15 07:50:25 CEST 2014

I just posted an updated version of PEP 467 after recently finishing
the updates to the Python 3.4+ binary sequence docs to decouple them
from the str docs.

Key points in the proposal:

* deprecate passing integers to bytes() and bytearray()
* add bytes.zeros() and bytearray.zeros() as a replacement
* add bytes.byte() and bytearray.byte() as counterparts to ord() for binary data
* add bytes.iterbytes(), bytearray.iterbytes() and memoryview.iterbytes()

As far as I am aware, that last item poses the only open question,
with the alternative being to add an "iterbytes" builtin with a
definition along the lines of the following:

    def iterbytes(data):
        try:
            getiter = type(data).__iterbytes__
        except AttributeError:
            iter = map(bytes.byte, data)
        else:
            iter = getiter(data)
        return iter

Regards,
Nick.

PEP URL: http://www.python.org/dev/peps/pep-0467/

Full PEP text:
=============================
PEP: 467
Title: Minor API improvements for bytes and bytearray
Version: $Revision$
Last-Modified: $Date$
Author: Nick Coghlan <ncoghlan at gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 2014-03-30
Python-Version: 3.5
Post-History: 2014-03-30 2014-08-15

Abstract
========

During the initial development of the Python 3 language specification, the
core ``bytes`` type for arbitrary binary data started as the mutable type
that is now referred to as ``bytearray``. Other aspects of operating in
the binary domain in Python have also evolved over the course of the Python
3 series.

This PEP proposes a number of small adjustments to the APIs of the ``bytes``
and ``bytearray`` types to make it easier to operate entirely in the binary
domain.

Background
==========

To simplify the task of writing the Python 3 documentation, the ``bytes``
and ``bytearray`` types were documented primarily in terms of the way they
differed from the Unicode based Python 3 ``str`` type. Even when I
`heavily revised the sequence documentation
<http://hg.python.org/cpython/rev/463f52d20314>`__ in 2012, I retained that
simplifying shortcut.

However, it turns out that this approach to the documentation of these types
had a problem: it doesn't adequately introduce users to their hybrid nature,
where they can be manipulated *either* as a "sequence of integers" type,
*or* as ``str``-like types that assume ASCII compatible data.

That oversight has now been corrected, with the binary sequence types now
being documented entirely independently of the ``str`` documentation in
`Python 3.4+ <https://docs.python.org/3/library/stdtypes.html#binary-sequence-types-bytes-bytearray-memoryview>`__

The confusion isn't just a documentation issue, however, as there are also
some lingering design quirks from an earlier pre-release design where there
was *no* separate ``bytearray`` type, and instead the core ``bytes`` type
was mutable (with no immutable counterpart).

Finally, additional experience with using the existing Python 3 binary
sequence types in real world applications has suggested it would be
beneficial to make it easier to convert integers to length 1 bytes objects.

Proposals
=========

As a "consistency improvement" proposal, this PEP is actually about a few
smaller micro-proposals, each aimed at improving the usability of the binary
data model in Python 3. Proposals are motivated by one of two main factors:

* removing remnants of the original design of ``bytes`` as a mutable type
* allowing users to easily convert integer values to a length 1 ``bytes``
  object

Alternate Constructors
----------------------

The ``bytes`` and ``bytearray`` constructors currently accept an integer
argument, but interpret it to mean a zero-filled object of the given length.
This is a legacy of the original design of ``bytes`` as a mutable type,
rather than a particularly intuitive behaviour for users. It has become
especially confusing now that some other ``bytes`` interfaces treat integers
and the corresponding length 1 bytes instances as equivalent input.
Compare::

    >>> b"\x03" in bytes([1, 2, 3])
    True
    >>> 3 in bytes([1, 2, 3])
    True

    >>> bytes(b"\x03")
    b'\x03'
    >>> bytes(3)
    b'\x00\x00\x00'

This PEP proposes that the current handling of integers in the bytes and
bytearray constructors by deprecated in Python 3.5 and targeted for
removal in Python 3.7, being replaced by two more explicit alternate
constructors provided as class methods. The initial python-ideas thread
[ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating
this constructor behaviour.

Firstly, a ``byte`` constructor is proposed that converts integers
in the range 0 to 255 (inclusive) to a ``bytes`` object::

    >>> bytes.byte(3)
    b'\x03'
    >>> bytearray.byte(3)
    bytearray(b'\x03')
    >>> bytes.byte(512)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: bytes must be in range(0, 256)

One specific use case for this alternate constructor is to easily convert
the result of indexing operations on ``bytes`` and other binary sequences
from an integer to a ``bytes`` object. The documentation for this API
should note that its counterpart for the reverse conversion is ``ord()``.
The ``ord()`` documentation will also be updated to note that while
``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and
``bytearray.byte`` are the counterparts for binary input.

Secondly, a ``zeros`` constructor is proposed that serves as a direct
replacement for the current constructor behaviour, rather than having to use
sequence repetition to achieve the same effect in a less intuitive way::

    >>> bytes.zeros(3)
    b'\x00\x00\x00'
    >>> bytearray.zeros(3)
    bytearray(b'\x00\x00\x00')

The chosen name here is taken from the corresponding initialisation function
in NumPy (although, as these are sequence types rather than N-dimensional
matrices, the constructors take a length as input rather than a shape tuple)

While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more
useful duo amongst the new constructors, ``bytes.zeros`` and
`bytearray.byte`` are provided in order to maintain API consistency between
the two types.

Iteration
---------

While iteration over ``bytes`` objects and other binary sequences produces
integers, it is sometimes desirable to iterate over length 1 bytes objects
instead.

To handle this situation more obviously (and more efficiently) than would be
the case with the ``map(bytes.byte, data)`` construct enabled by the above
constructor changes, this PEP proposes the addition of a new ``iterbytes``
method to ``bytes``, ``bytearray`` and ``memoryview``::

    for x in data.iterbytes():
        # x is a length 1 ``bytes`` object, rather than an integer

Third party types and arbitrary containers of integers that lack the new
method can still be handled by combining ``map`` with the new
``bytes.byte()`` alternate constructor proposed above::

    for x in map(bytes.byte, data):
        # x is a length 1 ``bytes`` object, rather than an integer
        # This works with *any* container of integers in the range
        # 0 to 255 inclusive

Open questions
^^^^^^^^^^^^^^

* The fallback case above suggests that this could perhaps be better handled
  as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()``
  if defined, but otherwise fell back to ``map(bytes.byte, data)``::

    for x in iterbytes(data):
        # x is a length 1 ``bytes`` object, rather than an integer
        # This works with *any* container of integers in the range
        # 0 to 255 inclusive

References
==========

.. [ideas-thread1]
https://mail.python.org/pipermail/python-ideas/2014-March/027295.html
.. [empty-buffer-issue] http://bugs.python.org/issue20895
.. [GvR-initial-feedback]
https://mail.python.org/pipermail/python-ideas/2014-March/027376.html

Copyright
=========

This document has been placed in the public domain.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia