[Python-ideas] Fixing the Python 3 bytes constructor

Sun Mar 30 04:17:08 CEST 2014

On 30 March 2014 07:07, Nick Coghlan <ncoghlan at gmail.com> wrote:
> I already have a draft PEP written that covers the constructor issue,
> iteration and adding acceptance of integer inputs to the remaining
> methods that don't currently handle them. There was some background
> explanation of the text/binary domain split in the Python 2->3
> transition that I wanted Guido's feedback on before posting, but I
> just realised I can cut that out for now, and then add it back after
> Guido has had a chance to review it.
>
> So I'll tidy that up and get the draft posted later today.

Guido pointed out most of the stuff I had asked him to look at wasn't
actually relevant to the PEP, so I just cut most of it entirely.
Suffice to say, after stepping back and reviewing them systematically
for the first time in years, I believe the APIs for the core binary
data types in Python 3 could do with a little sprucing up :)

Web version: http://www.python.org/dev/peps/pep-0467/

======================================
PEP: 467
Title: Improved API consistency for bytes and bytearray
Version: $Revision$
Last-Modified: $Date$
Author: Nick Coghlan <ncoghlan at gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 2014-03-30
Python-Version: 3.5
Post-History: 2014-03-30

Abstract
========

During the initial development of the Python 3 language specification, the
core ``bytes`` type for arbitrary binary data started as the mutable type
that is now referred to as ``bytearray``. Other aspects of operating in
the binary domain in Python have also evolved over the course of the Python
3 series.

This PEP proposes a number of small adjustments to the APIs of the ``bytes``
and ``bytearray`` types to make their behaviour more internally consistent
and to make it easier to operate entirely in the binary domain for use cases
that actually involve manipulating binary data directly, rather than
converting it to a more structured form with additional modelling
semantics (such as ``str``) and then converting back to binary format after
processing.

Background
==========

Over the course of Python 3's evolution, a number of adjustments have been
made to the core ``bytes`` and ``bytearray`` types as additional practical
experience was gained with using them in code beyond the Python 3 standard
library and test suite. However, to date, these changes have been made
on a relatively ad hoc tactical basis as specific issues were identified,
rather than as part of a systematic review of the APIs of these types. This
approach has allowed inconsistencies to creep into the API design as to which
input types are accepted by different methods. Additional inconsistencies
linger from an earlier pre-release design where there was *no* separate
``bytearray`` type, and instead the core ``bytes`` type was mutable (with
no immutable counterpart), as well as from the origins of these types in
the text-like behaviour of the Python 2 ``str`` type.

This PEP aims to provide the missing systematic review, with the goal of
ensuring that wherever feasible (given backwards compatibility constraints)
these current inconsistencies are addressed for the Python 3.5 release.

Proposals
=========

As a "consistency improvement" proposal, this PEP is actually about a number
of smaller micro-proposals, each aimed at improving the self-consistency of
the binary data model in Python 3. Proposals are motivated by one of three
factors:

* removing remnants of the original design of ``bytes`` as a mutable type
* more consistently accepting length 1 ``bytes`` objects as input where an
  integer between ``0`` and ``255`` inclusive is expected, and vice-versa
* allowing users to easily convert integer output to a length 1 ``bytes``
  object

Alternate Constructors
----------------------

The ``bytes`` and ``bytearray`` constructors currently accept an integer
argument, but interpret it to mean a zero-filled object of the given length.
This is a legacy of the original design of ``bytes`` as a mutable type,
rather than a particularly intuitive behaviour for users. It has become
especially confusing now that other ``bytes`` interfaces treat integers
and the corresponding length 1 bytes instances as equivalent input.
Compare::

    >>> b"\x03" in bytes([1, 2, 3])
    True
    >>> 3 in bytes([1, 2, 3])
    True

    >>> bytes(b"\x03")
    b'\x03'
    >>> bytes(3)
    b'\x00\x00\x00'

This PEP proposes that the current handling of integers in the bytes and
bytearray constructors by deprecated in Python 3.5 and removed in Python
3.6, being replaced by two more type appropriate alternate constructors
provided as class methods. The initial python-ideas thread [ideas-thread1]_
that spawned this PEP was specifically aimed at deprecating this constructor
behaviour.

For ``bytes``, a ``byte`` constructor is proposed that converts integers
(as indicated by ``operator.index``) in the appropriate range to a ``bytes``
object, converts objects that support the buffer API to bytes, and also
passes through length 1 byte strings unchanged::

    >>> bytes.byte(3)
    b'\x03'
    >>> bytes.byte(bytearray(bytes([3])))
    b'\x03'
    >>> bytes.byte(memoryview(bytes([3])))
    b'\x03'
    >>> bytes.byte(bytes([3]))
    b'\x03'
    >>> bytes.byte(512)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: bytes must be in range(0, 256)
    >>> bytes.byte(b"ab")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: bytes.byte() expected a byte, but buffer of length 2 found

One specific use case for this alternate constructor is to easily convert
the result of indexing operations on ``bytes`` and other binary sequences
from an integer to a ``bytes`` object. The documentation for this API
should note that its counterpart for the reverse conversion is ``ord()``.

For ``bytearray``, a ``from_len`` constructor is proposed that preallocates
the buffer filled with a particular value (default to ``0``) as a direct
replacement for the current constructor behaviour, rather than having to use
sequence repetition to achieve the same effect in a less intuitive way::

    >>> bytearray.from_len(3)
    bytearray(b'\x00\x00\x00')
    >>> bytearray.from_len(3, 6)
    bytearray(b'\x06\x06\x06')

This part of the proposal was covered by an existing issue
[empty-buffer-issue]_ and a variety of names have been proposed
(``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The
specific name currently proposed was chosen by analogy with
``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely
explicit that it is an alternate constructor rather than an in-place
mutation, as well as how it differs from the standard constructor.

Open questions
^^^^^^^^^^^^^^

* Should ``bytearray.byte()`` also be added? Or is
  ``bytearray(bytes.byte(x))`` sufficient for that case?
* Should ``bytes.from_len()`` also be added? Or is sequence repetition
  sufficient for that case?
* Should ``bytearray.from_len()`` use a different name?
* Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary
  sequences with more than one element? The ``TypeError`` currently proposed
  is copied (with slightly improved wording) from the behaviour of ``ord()``
  with sequences containing more than one code point, while ``ValueError``
  would be more consistent with the existing handling of out-of-range
  integer values.
* ``bytes.byte()`` is defined above as accepting length 1 binary sequences
  as individual bytes, but this is currently inconsistent with the main
  ``bytes`` constructor::

      >>> bytes([b"a", b"b", b"c"])
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      TypeError: 'bytes' object cannot be interpreted as an integer

  Should the ``bytes`` constructor be changed to accept iterables of length 1
  bytes objects in addition to iterables of integers? If so, should it
  allow a mixture of the two in a single iterable?

Iteration
---------

Iteration over ``bytes`` objects and other binary sequences produces
integers. Rather than proposing a new method that would need to be added
not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially
to third party types as well, this PEP proposes that iteration to produce
length 1 ``bytes`` objects instead be handled by combining ``map`` with
the new ``bytes.byte()`` alternate constructor proposed above::

    for x in map(bytes.byte, data):
        # x is a length 1 ``bytes`` object, rather than an integer
        # This works with *any* container of integers in the range
        # 0 to 255 inclusive

Consistent support for different input types
--------------------------------------------

In Python 3.3, the binary search operations (``in``, ``count()``,
``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to
accept integers in the range 0 to 255 (inclusive) as their first argument
(in addition to the existing support for binary sequences).

This PEP proposes extending that behaviour of accepting integers as being
equivalent to the corresponding length 1 binary sequence to several other
``bytes`` and ``bytearray`` methods that currently expect a ``bytes``
object for certain parameters. In essence, if a value is an acceptable
input to the new ``bytes.byte`` constructor defined above, then it would
be acceptable in the roles defined here (in addition to any other already
supported inputs):

* ``startswith()`` prefix(es)
* ``endswith()`` suffix(es)

* ``center()`` fill character
* ``ljust()`` fill character
* ``rjust()`` fill character

* ``strip()`` character to strip
* ``lstrip()`` character to strip
* ``rstrip()`` character to strip

* ``partition()`` separator argument
* ``rpartition()`` separator argument

* ``split()`` separator argument
* ``rsplit()`` separator argument

* ``replace()`` old value and new value

In addition to the consistency motive, this approach also makes it easier
to work with the indexing behaviour , as the result of an indexing operation
can more easily be fed back in to other methods.

For ``bytearray``, some additional changes are proposed to the current
integer based operations to ensure they remain consistent with the proposed
constructor changes::

* ``append()``: updated to be consistent with ``bytes.byte()``
* ``remove()``: updated to be consistent with ``bytes.byte()``
* ``+=``: updated to be consistent with ``bytes()`` changes (if any)
* ``extend()``: updated to be consistent with ``bytes()`` changes (if any)

Acknowledgement of surprising behaviour of some ``bytearray`` methods
---------------------------------------------------------------------

Several of the ``bytes`` and ``bytearray`` methods have their origins in the
Python 2 ``str`` API. As ``str`` is an immutable type, all of these
operations are defined as returning a *new* instance, rather than operating
in place. This contrasts with methods on other mutable types like ``list``,
where ``list.sort()`` and ``list.reverse()`` operate in-place and return
``None``, rather than creating a new object.

Backwards compatibility constraints make it impractical to change this
behaviour at this point, but it may be appropriate to explicitly call out
this quirk in the documentation for the ``bytearray`` type. It affects the
following methods that could reasonably be expected to operate in-place on
a mutable type:

* ``center()``
* ``ljust()``
* ``rjust()``
* ``strip()``
* ``lstrip()``
* ``rstrip()``
* ``replace()``
* ``lower()``
* ``upper()``
* ``swapcase()``
* ``title()``
* ``capitalize()``
* ``translate()``
* ``expandtabs()``
* ``zfill()``

Note that the following ``bytearray`` operations *do* operate in place, as
they're part of the mutable sequence API in ``bytearray``, rather than being
inspired by the immutable Python 2 ``str`` API:

* ``+=``
* ``append()``
* ``extend()``
* ``reverse()``
* ``remove()``
* ``pop()``

References
==========

.. [ideas-thread1]
https://mail.python.org/pipermail/python-ideas/2014-March/027295.html
.. [empty-buffer-issue] http://bugs.python.org/issue20895

Copyright
=========

This document has been placed in the public domain.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia