[Python-ideas] Fixing the Python 3 bytes constructor

Guido van Rossum guido at python.org
Sun Mar 30 18:05:44 CEST 2014

On Sat, Mar 29, 2014 at 7:17 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 30 March 2014 07:07, Nick Coghlan <ncoghlan at gmail.com> wrote:
> > I already have a draft PEP written that covers the constructor issue,
> > iteration and adding acceptance of integer inputs to the remaining
> > methods that don't currently handle them. There was some background
> > explanation of the text/binary domain split in the Python 2->3
> > transition that I wanted Guido's feedback on before posting, but I
> > just realised I can cut that out for now, and then add it back after
> > Guido has had a chance to review it.
> >
> > So I'll tidy that up and get the draft posted later today.
> Guido pointed out most of the stuff I had asked him to look at wasn't
> actually relevant to the PEP, so I just cut most of it entirely.
> Suffice to say, after stepping back and reviewing them systematically
> for the first time in years, I believe the APIs for the core binary
> data types in Python 3 could do with a little sprucing up :)

Thanks for cutting it down, it's easier to concentrate on the essentials

> Web version: http://www.python.org/dev/peps/pep-0467/
> ======================================
> PEP: 467
> Title: Improved API consistency for bytes and bytearray
> Version: $Revision$
> Last-Modified: $Date$
> Author: Nick Coghlan <ncoghlan at gmail.com>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 2014-03-30
> Python-Version: 3.5
> Post-History: 2014-03-30
> Abstract
> ========
> During the initial development of the Python 3 language specification, the
> core ``bytes`` type for arbitrary binary data started as the mutable type
> that is now referred to as ``bytearray``. Other aspects of operating in
> the binary domain in Python have also evolved over the course of the Python
> 3 series.
> This PEP proposes a number of small adjustments to the APIs of the
> ``bytes``
> and ``bytearray`` types to make their behaviour more internally consistent
> and to make it easier to operate entirely in the binary domain for use
> cases
> that actually involve manipulating binary data directly, rather than
> converting it to a more structured form with additional modelling
> semantics (such as ``str``) and then converting back to binary format after
> processing.

I hope you don't mind I cut the last 60% of this sentence (everything after
"binary domain").

> Background
> ==========
> Over the course of Python 3's evolution, a number of adjustments have been
> made to the core ``bytes`` and ``bytearray`` types as additional practical
> experience was gained with using them in code beyond the Python 3 standard
> library and test suite. However, to date, these changes have been made
> on a relatively ad hoc tactical basis as specific issues were identified,
> rather than as part of a systematic review of the APIs of these types.

I'm not sure you can claim that. We probably have more information based on
experience now than when we did the redesign. (At that time most experience
was based on using str() for binary data.)

> This
> approach has allowed inconsistencies to creep into the API design as to
> which
> input types are accepted by different methods. Additional inconsistencies
> linger from an earlier pre-release design where there was *no* separate
> ``bytearray`` type, and instead the core ``bytes`` type was mutable (with
> no immutable counterpart), as well as from the origins of these types in
> the text-like behaviour of the Python 2 ``str`` type.

You make it sound as if modeling bytes() after Python 2's str() was an
accident. It wasn't.

> This PEP aims to provide the missing systematic review, with the goal of
> ensuring that wherever feasible (given backwards compatibility constraints)
> these current inconsistencies are addressed for the Python 3.5 release.

I would like to convince you to aim lower, drop the "systematic review",
and just focus on some changes that are likely to improve users' experience
(which includes porting Python 2 code).

> Proposals
> =========
> As a "consistency improvement" proposal, this PEP is actually about a
> number
> of smaller micro-proposals, each aimed at improving the self-consistency of
> the binary data model in Python 3. Proposals are motivated by one of three
> factors:
> * removing remnants of the original design of ``bytes`` as a mutable type


* more consistently accepting length 1 ``bytes`` objects as input where an
>   integer between ``0`` and ``255`` inclusive is expected, and vice-versa

Not sure I like this as a goal. OK, stronger: I don't like this goal.

> * allowing users to easily convert integer output to a length 1 ``bytes``
>   object

I think you meant integer values instead of output? In Python 2 we did this
with the global function chr(), but in Python 3 that creates a str(). (The
history of chr() and ord() sa built-in functions is that they long predates
the notion of methods (class- or otherwise), and their naming comes
straight from Pascal.)

Anyway, I don't know that the use case is so common that it needs more than
bytes([i]) or bytearray([i]) -- if there is an argument to be made for
bytes.byte(i) and bytearray.byte(i) it would be that the [i] in the
constructor is somewhat hard to grasp.

> Alternate Constructors
> ----------------------
> The ``bytes`` and ``bytearray`` constructors currently accept an integer
> argument, but interpret it to mean a zero-filled object of the given
> length.

This is one of the two legacies of the original "mutable bytes" design, and
I agree we should strive to replace it -- although I think one round of
deprecation may be too quick. (The other legacy is of course that b[i] is
an int, not a bytes -- it's the worse problem, but I don't think we can fix
it without breaking more than the fix would be worth.)

> This is a legacy of the original design of ``bytes`` as a mutable type,
> rather than a particularly intuitive behaviour for users. It has become
> especially confusing now that other ``bytes`` interfaces treat integers
> and the corresponding length 1 bytes instances as equivalent input.
> Compare::
>     >>> b"\x03" in bytes([1, 2, 3])
>     True
>     >>> 3 in bytes([1, 2, 3])
>     True
>     >>> bytes(b"\x03")
>     b'\x03'
>     >>> bytes(3)
>     b'\x00\x00\x00'
> This PEP proposes that the current handling of integers in the bytes and
> bytearray constructors by deprecated in Python 3.5 and removed in Python
> 3.6, being replaced by two more type appropriate alternate constructors
> provided as class methods. The initial python-ideas thread [ideas-thread1]_
> that spawned this PEP was specifically aimed at deprecating this
> constructor
> behaviour.
> For ``bytes``, a ``byte`` constructor is proposed that converts integers
> (as indicated by ``operator.index``)

I know why you reference this, but it feels confusing to me. At this point
in the narrative it's better to just say "integer" and explain how it
decides "integer-ness" later.

> in the appropriate range to a ``bytes``
> object, converts objects that support the buffer API to bytes, and also
> passes through length 1 byte strings unchanged::

I think the second half (accepting bytes instances of length 1) is wrong
here and doesn't actually have a practical use case. I'll say more below.

>     >>> bytes.byte(3)
>     b'\x03'
>     >>> bytes.byte(bytearray(bytes([3])))
>     b'\x03'
>     >>> bytes.byte(memoryview(bytes([3])))
>     b'\x03'
>     >>> bytes.byte(bytes([3]))
>     b'\x03'
>     >>> bytes.byte(512)
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>     ValueError: bytes must be in range(0, 256)
>     >>> bytes.byte(b"ab")
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>     TypeError: bytes.byte() expected a byte, but buffer of length 2 found
> One specific use case for this alternate constructor is to easily convert
> the result of indexing operations on ``bytes`` and other binary sequences
> from an integer to a ``bytes`` object. The documentation for this API
> should note that its counterpart for the reverse conversion is ``ord()``.

However, in a pinch, b[0] will do as well, assuming you don't need the
length check implied by ord().

> For ``bytearray``, a ``from_len`` constructor is proposed that preallocates
> the buffer filled with a particular value (default to ``0``) as a direct
> replacement for the current constructor behaviour, rather than having to
> use
> sequence repetition to achieve the same effect in a less intuitive way::
>     >>> bytearray.from_len(3)
>     bytearray(b'\x00\x00\x00')
>     >>> bytearray.from_len(3, 6)
>     bytearray(b'\x06\x06\x06')
> This part of the proposal was covered by an existing issue
> [empty-buffer-issue]_ and a variety of names have been proposed
> (``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The
> specific name currently proposed was chosen by analogy with
> ``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely
> explicit that it is an alternate constructor rather than an in-place
> mutation, as well as how it differs from the standard constructor.

I think you need to brainstorm more on the name; from_len() looks pretty
awkward. And I think it's better to add it to bytes() as well, since the
two classes intentionally try to be as similar as possible.

> Open questions
> ^^^^^^^^^^^^^^
> * Should ``bytearray.byte()`` also be added? Or is
>   ``bytearray(bytes.byte(x))`` sufficient for that case?

It should be added.

> * Should ``bytes.from_len()`` also be added? Or is sequence repetition
>   sufficient for that case?

It should be added.

> * Should ``bytearray.from_len()`` use a different name?


> * Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary
>   sequences with more than one element? The ``TypeError`` currently
> proposed
>   is copied (with slightly improved wording) from the behaviour of
> ``ord()``
>   with sequences containing more than one code point, while ``ValueError``
>   would be more consistent with the existing handling of out-of-range
>   integer values.

It should not accept any bytes arguments. But if somehow you convince me
otherwise, it should be ValueError (and honestly, ord() is wrong there).

> * ``bytes.byte()`` is defined above as accepting length 1 binary sequences
>   as individual bytes, but this is currently inconsistent with the main
>   ``bytes`` constructor::
>       >>> bytes([b"a", b"b", b"c"])
>       Traceback (most recent call last):
>         File "<stdin>", line 1, in <module>
>       TypeError: 'bytes' object cannot be interpreted as an integer
>   Should the ``bytes`` constructor be changed to accept iterables of
> length 1
>   bytes objects in addition to iterables of integers? If so, should it
>   allow a mixture of the two in a single iterable?


> Iteration
> ---------
> Iteration over ``bytes`` objects and other binary sequences produces
> integers. Rather than proposing a new method that would need to be added
> not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially
> to third party types as well, this PEP proposes that iteration to produce
> length 1 ``bytes`` objects instead be handled by combining ``map`` with
> the new ``bytes.byte()`` alternate constructor proposed above::
>     for x in map(bytes.byte, data):
>         # x is a length 1 ``bytes`` object, rather than an integer
>         # This works with *any* container of integers in the range
>         # 0 to 255 inclusive

I can see why you don't like a new method, but this idiom is way too
verbose and unintuitive to ever gain traction. Let's just add a new method
to all three types, 3rd party types will get the message.

> Consistent support for different input types
> --------------------------------------------
> In Python 3.3, the binary search operations (``in``, ``count()``,
> ``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to
> accept integers in the range 0 to 255 (inclusive) as their first argument
> (in addition to the existing support for binary sequences).

I wonder if that wasn't a bit over-zealous. While 'in', count() and index()
are sequence methods (looking for elements) that have an extended meaning
(looking for substrings) for string types, the find() and r*() variants are
only defined for strings.

> This PEP proposes extending that behaviour of accepting integers as being
> equivalent to the corresponding length 1 binary sequence to several other
> ``bytes`` and ``bytearray`` methods that currently expect a ``bytes``
> object for certain parameters. In essence, if a value is an acceptable
> input to the new ``bytes.byte`` constructor defined above, then it would
> be acceptable in the roles defined here (in addition to any other already
> supported inputs):
> * ``startswith()`` prefix(es)
> * ``endswith()`` suffix(es)
> * ``center()`` fill character
> * ``ljust()`` fill character
> * ``rjust()`` fill character
> * ``strip()`` character to strip
> * ``lstrip()`` character to strip
> * ``rstrip()`` character to strip
> * ``partition()`` separator argument
> * ``rpartition()`` separator argument
> * ``split()`` separator argument
> * ``rsplit()`` separator argument
> * ``replace()`` old value and new value
> In addition to the consistency motive, this approach also makes it easier
> to work with the indexing behaviour , as the result of an indexing
> operation
> can more easily be fed back in to other methods.

I think herein lies madness. The intention seems to be to paper over as
much as possible the unfortunate behavior of b[i]. But how often does any
of these methods get called with such a construct? And how often will that
be in a context where this is the *only* thing that is affected by b[i]
returning an int in Python 3 but a string in Python 2? (In my experience
these are mostly called with literal arguments, except inside wrapper
functions that are themselves intended to be called with a literal
argument.) Weakening the type checking here seems a bad idea -- it would
accept integers in *any* context, and that would just cause more nasty
debugging issues.

> For ``bytearray``, some additional changes are proposed to the current
> integer based operations to ensure they remain consistent with the proposed
> constructor changes::
> * ``append()``: updated to be consistent with ``bytes.byte()``
> * ``remove()``: updated to be consistent with ``bytes.byte()``
> * ``+=``: updated to be consistent with ``bytes()`` changes (if any)
> * ``extend()``: updated to be consistent with ``bytes()`` changes (if any)

Eew again. These are operations from the MutableSequence ABC and there is
no reason to make their signatures fuzzier.

> Acknowledgement of surprising behaviour of some ``bytearray`` methods
> ---------------------------------------------------------------------
> Several of the ``bytes`` and ``bytearray`` methods have their origins in
> the
> Python 2 ``str`` API.

You make it sound as if this is a bad thing or an accident.

> As ``str`` is an immutable type, all of these
> operations are defined as returning a *new* instance, rather than operating
> in place. This contrasts with methods on other mutable types like ``list``,
> where ``list.sort()`` and ``list.reverse()`` operate in-place and return
> ``None``, rather than creating a new object.

So does bytestring.reverse(). And if you really insist we can add
bytestring.sort(). :-)

> Backwards compatibility constraints make it impractical to change this
> behaviour at this point, but it may be appropriate to explicitly call out
> this quirk in the documentation for the ``bytearray`` type. It affects the
> following methods that could reasonably be expected to operate in-place on
> a mutable type:
> * ``center()``
> * ``ljust()``
> * ``rjust()``
> * ``strip()``
> * ``lstrip()``
> * ``rstrip()``
> * ``replace()``
> * ``lower()``
> * ``upper()``
> * ``swapcase()``
> * ``title()``
> * ``capitalize()``
> * ``translate()``
> * ``expandtabs()``
> * ``zfill()``

That all feels like hypercorrection. These are string methods and it would
be completely wrong if bytearray changed them to modify the object
in-place. I also don't see why anyone would think these would modify the
object, given that everybody encounters these first for the str() type,
then for bytes(), then finally (by extension) for bytearray().

The *only* place where there should be any confusion about whether the
value is mutated or the variable is updated with a new object would be the
+= operator (and *=) but that's due to that operator's ambiguity.

> Note that the following ``bytearray`` operations *do* operate in place, as
> they're part of the mutable sequence API in ``bytearray``, rather than
> being
> inspired by the immutable Python 2 ``str`` API:
> * ``+=``
> * ``append()``
> * ``extend()``
> * ``reverse()``
> * ``remove()``
> * ``pop()``

Right. And there's nothing wrong with this.

> References
> ==========
> .. [ideas-thread1]
> https://mail.python.org/pipermail/python-ideas/2014-March/027295.html
> .. [empty-buffer-issue] http://bugs.python.org/issue20895
> Copyright
> =========
> This document has been placed in the public domain.
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/

--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140330/06a5ffd0/attachment-0001.html>

More information about the Python-ideas mailing list