[Python-ideas] Fixing the Python 3 bytes constructor

Sun Mar 30 08:10:53 CEST 2014

On Sat, Mar 29, 2014 at 7:17 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 30 March 2014 07:07, Nick Coghlan <ncoghlan at gmail.com> wrote:
> > I already have a draft PEP written that covers the constructor issue,
> > iteration and adding acceptance of integer inputs to the remaining
> > methods that don't currently handle them. There was some background
> > explanation of the text/binary domain split in the Python 2->3
> > transition that I wanted Guido's feedback on before posting, but I
> > just realised I can cut that out for now, and then add it back after
> > Guido has had a chance to review it.
> >
> > So I'll tidy that up and get the draft posted later today.
>
> Guido pointed out most of the stuff I had asked him to look at wasn't
> actually relevant to the PEP, so I just cut most of it entirely.
> Suffice to say, after stepping back and reviewing them systematically
> for the first time in years, I believe the APIs for the core binary
> data types in Python 3 could do with a little sprucing up :)
>
> Web version: http://www.python.org/dev/peps/pep-0467/
>
> ======================================
> PEP: 467
> Title: Improved API consistency for bytes and bytearray
> Version: $Revision$
> Last-Modified: $Date$
> Author: Nick Coghlan <ncoghlan at gmail.com>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 2014-03-30
> Python-Version: 3.5
> Post-History: 2014-03-30
>
>
> Abstract
> ========
>
> During the initial development of the Python 3 language specification, the
> core ``bytes`` type for arbitrary binary data started as the mutable type
> that is now referred to as ``bytearray``. Other aspects of operating in
> the binary domain in Python have also evolved over the course of the Python
> 3 series.
>
> This PEP proposes a number of small adjustments to the APIs of the
> ``bytes``
> and ``bytearray`` types to make their behaviour more internally consistent
> and to make it easier to operate entirely in the binary domain for use
> cases
> that actually involve manipulating binary data directly, rather than
> converting it to a more structured form with additional modelling
> semantics (such as ``str``) and then converting back to binary format after
> processing.
>
>
> Background
> ==========
>
> Over the course of Python 3's evolution, a number of adjustments have been
> made to the core ``bytes`` and ``bytearray`` types as additional practical
> experience was gained with using them in code beyond the Python 3 standard
> library and test suite. However, to date, these changes have been made
> on a relatively ad hoc tactical basis as specific issues were identified,
> rather than as part of a systematic review of the APIs of these types. This
> approach has allowed inconsistencies to creep into the API design as to
> which
> input types are accepted by different methods. Additional inconsistencies
> linger from an earlier pre-release design where there was *no* separate
> ``bytearray`` type, and instead the core ``bytes`` type was mutable (with
> no immutable counterpart), as well as from the origins of these types in
> the text-like behaviour of the Python 2 ``str`` type.
>
> This PEP aims to provide the missing systematic review, with the goal of
> ensuring that wherever feasible (given backwards compatibility constraints)
> these current inconsistencies are addressed for the Python 3.5 release.
>
>
> Proposals
> =========
>
> As a "consistency improvement" proposal, this PEP is actually about a
> number
> of smaller micro-proposals, each aimed at improving the self-consistency of
> the binary data model in Python 3. Proposals are motivated by one of three
> factors:
>
> * removing remnants of the original design of ``bytes`` as a mutable type
> * more consistently accepting length 1 ``bytes`` objects as input where an
>   integer between ``0`` and ``255`` inclusive is expected, and vice-versa
> * allowing users to easily convert integer output to a length 1 ``bytes``
>   object
>
>
> Alternate Constructors
> ----------------------
>
> The ``bytes`` and ``bytearray`` constructors currently accept an integer
> argument, but interpret it to mean a zero-filled object of the given
> length.
> This is a legacy of the original design of ``bytes`` as a mutable type,
> rather than a particularly intuitive behaviour for users. It has become
> especially confusing now that other ``bytes`` interfaces treat integers
> and the corresponding length 1 bytes instances as equivalent input.
> Compare::
>
>     >>> b"\x03" in bytes([1, 2, 3])
>     True
>     >>> 3 in bytes([1, 2, 3])
>     True
>
>     >>> bytes(b"\x03")
>     b'\x03'
>     >>> bytes(3)
>     b'\x00\x00\x00'
>
> This PEP proposes that the current handling of integers in the bytes and
> bytearray constructors by deprecated in Python 3.5 and removed in Python
> 3.6, being replaced by two more type appropriate alternate constructors
> provided as class methods. The initial python-ideas thread [ideas-thread1]_
> that spawned this PEP was specifically aimed at deprecating this
> constructor
> behaviour.
>
> For ``bytes``, a ``byte`` constructor is proposed that converts integers
> (as indicated by ``operator.index``) in the appropriate range to a
> ``bytes``
> object, converts objects that support the buffer API to bytes, and also
> passes through length 1 byte strings unchanged::
>
>     >>> bytes.byte(3)
>     b'\x03'
>     >>> bytes.byte(bytearray(bytes([3])))
>     b'\x03'
>     >>> bytes.byte(memoryview(bytes([3])))
>     b'\x03'
>     >>> bytes.byte(bytes([3]))
>     b'\x03'
>     >>> bytes.byte(512)
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>     ValueError: bytes must be in range(0, 256)
>     >>> bytes.byte(b"ab")
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>     TypeError: bytes.byte() expected a byte, but buffer of length 2 found
>
> One specific use case for this alternate constructor is to easily convert
> the result of indexing operations on ``bytes`` and other binary sequences
> from an integer to a ``bytes`` object. The documentation for this API
> should note that its counterpart for the reverse conversion is ``ord()``.
>
> For ``bytearray``, a ``from_len`` constructor is proposed that preallocates
> the buffer filled with a particular value (default to ``0``) as a direct
> replacement for the current constructor behaviour, rather than having to
> use
> sequence repetition to achieve the same effect in a less intuitive way::
>
>     >>> bytearray.from_len(3)
>     bytearray(b'\x00\x00\x00')
>     >>> bytearray.from_len(3, 6)
>     bytearray(b'\x06\x06\x06')
>
> This part of the proposal was covered by an existing issue
> [empty-buffer-issue]_ and a variety of names have been proposed
> (``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The
> specific name currently proposed was chosen by analogy with
> ``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely
> explicit that it is an alternate constructor rather than an in-place
> mutation, as well as how it differs from the standard constructor.
>
>
> Open questions
> ^^^^^^^^^^^^^^
>
> * Should ``bytearray.byte()`` also be added? Or is
>   ``bytearray(bytes.byte(x))`` sufficient for that case?
> * Should ``bytes.from_len()`` also be added? Or is sequence repetition
>   sufficient for that case?
>

I prefer keeping them consistent across the types myself.

* Should ``bytearray.from_len()`` use a different name?
>

This name works for me.

> * Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary
>   sequences with more than one element? The ``TypeError`` currently
> proposed
>   is copied (with slightly improved wording) from the behaviour of
> ``ord()``
>   with sequences containing more than one code point, while ``ValueError``
>   would be more consistent with the existing handling of out-of-range
>   integer values.
> * ``bytes.byte()`` is defined above as accepting length 1 binary sequences
>   as individual bytes, but this is currently inconsistent with the main
>   ``bytes`` constructor::
>

I don't like that bytes.byte() would accept anything other than an int. It
should not accept length 1 binary sequences at all.  I'd prefer to see
bytes.byte(b"X") raise a TypeError.

>       >>> bytes([b"a", b"b", b"c"])
>       Traceback (most recent call last):
>         File "<stdin>", line 1, in <module>
>       TypeError: 'bytes' object cannot be interpreted as an integer
>
>   Should the ``bytes`` constructor be changed to accept iterables of
> length 1
>   bytes objects in addition to iterables of integers? If so, should it
>   allow a mixture of the two in a single iterable?
>
>
> Iteration
> ---------
>
> Iteration over ``bytes`` objects and other binary sequences produces
> integers. Rather than proposing a new method that would need to be added
> not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially
> to third party types as well, this PEP proposes that iteration to produce
> length 1 ``bytes`` objects instead be handled by combining ``map`` with
> the new ``bytes.byte()`` alternate constructor proposed above::
>
>     for x in map(bytes.byte, data):
>         # x is a length 1 ``bytes`` object, rather than an integer
>         # This works with *any* container of integers in the range
>         # 0 to 255 inclusive
>
>
> Consistent support for different input types
> --------------------------------------------
>
> In Python 3.3, the binary search operations (``in``, ``count()``,
> ``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to
> accept integers in the range 0 to 255 (inclusive) as their first argument
> (in addition to the existing support for binary sequences).
>
> This PEP proposes extending that behaviour of accepting integers as being
> equivalent to the corresponding length 1 binary sequence to several other
> ``bytes`` and ``bytearray`` methods that currently expect a ``bytes``
> object for certain parameters. In essence, if a value is an acceptable
> input to the new ``bytes.byte`` constructor defined above, then it would
> be acceptable in the roles defined here (in addition to any other already
> supported inputs):
>
> * ``startswith()`` prefix(es)
> * ``endswith()`` suffix(es)
>
> * ``center()`` fill character
> * ``ljust()`` fill character
> * ``rjust()`` fill character
>
> * ``strip()`` character to strip
> * ``lstrip()`` character to strip
> * ``rstrip()`` character to strip
>
> * ``partition()`` separator argument
> * ``rpartition()`` separator argument
>
> * ``split()`` separator argument
> * ``rsplit()`` separator argument
>
> * ``replace()`` old value and new value
>
> In addition to the consistency motive, this approach also makes it easier
> to work with the indexing behaviour , as the result of an indexing
> operation
> can more easily be fed back in to other methods.
>
> For ``bytearray``, some additional changes are proposed to the current
> integer based operations to ensure they remain consistent with the proposed
> constructor changes::
>
> * ``append()``: updated to be consistent with ``bytes.byte()``
> * ``remove()``: updated to be consistent with ``bytes.byte()``
> * ``+=``: updated to be consistent with ``bytes()`` changes (if any)
>

Where was a change to += behavior mentioned? I don't see that above (or did
I miss something?).

> * ``extend()``: updated to be consistent with ``bytes()`` changes (if any)
>
>
> Acknowledgement of surprising behaviour of some ``bytearray`` methods
> ---------------------------------------------------------------------
>
> Several of the ``bytes`` and ``bytearray`` methods have their origins in
> the
> Python 2 ``str`` API. As ``str`` is an immutable type, all of these
> operations are defined as returning a *new* instance, rather than operating
> in place. This contrasts with methods on other mutable types like ``list``,
> where ``list.sort()`` and ``list.reverse()`` operate in-place and return
> ``None``, rather than creating a new object.
>
> Backwards compatibility constraints make it impractical to change this
> behaviour at this point, but it may be appropriate to explicitly call out
> this quirk in the documentation for the ``bytearray`` type. It affects the
> following methods that could reasonably be expected to operate in-place on
> a mutable type:
>
> * ``center()``
> * ``ljust()``
> * ``rjust()``
> * ``strip()``
> * ``lstrip()``
> * ``rstrip()``
> * ``replace()``
> * ``lower()``
> * ``upper()``
> * ``swapcase()``
> * ``title()``
> * ``capitalize()``
> * ``translate()``
> * ``expandtabs()``
> * ``zfill()``
>
> Note that the following ``bytearray`` operations *do* operate in place, as
> they're part of the mutable sequence API in ``bytearray``, rather than
> being
> inspired by the immutable Python 2 ``str`` API:
>
> * ``+=``
> * ``append()``
> * ``extend()``
> * ``reverse()``
> * ``remove()``
> * ``pop()``
>
>
> References
> ==========
>
> .. [ideas-thread1]
> https://mail.python.org/pipermail/python-ideas/2014-March/027295.html
> .. [empty-buffer-issue] http://bugs.python.org/issue20895
>
>
> Copyright
> =========
>
> This document has been placed in the public domain.
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140329/4575b9d0/attachment-0001.html>