[Python-ideas] Fixing the Python 3 bytes constructor

Nick Coghlan ncoghlan at gmail.com
Tue Apr 1 15:26:50 CEST 2014


On 31 March 2014 02:05, Guido van Rossum <guido at python.org> wrote:
> On Sat, Mar 29, 2014 at 7:17 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> This PEP proposes a number of small adjustments to the APIs of the
>> ``bytes``
>> and ``bytearray`` types to make their behaviour more internally consistent
>> and to make it easier to operate entirely in the binary domain
>
>
> I hope you don't mind I cut the last 60% of this sentence (everything after
> "binary domain").

No worries - the shorter version is better. I've been spending too
much time in recent months explaining the significant of the text
model changes to different people, so now I end up trying to explain
it any time I write about it :)

>> Background
>> ==========
>>
>> Over the course of Python 3's evolution, a number of adjustments have been
>> made to the core ``bytes`` and ``bytearray`` types as additional practical
>> experience was gained with using them in code beyond the Python 3 standard
>> library and test suite. However, to date, these changes have been made
>> on a relatively ad hoc tactical basis as specific issues were identified,
>> rather than as part of a systematic review of the APIs of these types.
>
>
> I'm not sure you can claim that. We probably have more information based on
> experience now than when we did the redesign. (At that time most experience
> was based on using str() for binary data.)

Yeah, I was mostly thinking of the change to make the search APIs
accept both integers and subsequences when I wrote that. I'll try to
think of a better way of wording it.

I also realised in reviewing the docs that a key part of the problem
may actually be a shortcut I took in the sequence docs rewrite that I
did quite a while ago now - the bytes/bytearray docs are currently
written in terms of *how they differ from str*. They don't really
cover the "container of integers" aspect particularly well. I now
believe a better way to tackle that would to be upfront that these
types basically have two APIs on a single class: their core tuple/list
of integers "arbitrary binary data" API, and then the str-inspired
"binary data with ASCII segments" API on top of that.

>> This
>> approach has allowed inconsistencies to creep into the API design as to
>> which
>> input types are accepted by different methods. Additional inconsistencies
>> linger from an earlier pre-release design where there was *no* separate
>> ``bytearray`` type, and instead the core ``bytes`` type was mutable (with
>> no immutable counterpart), as well as from the origins of these types in
>> the text-like behaviour of the Python 2 ``str`` type.
>
> You make it sound as if modeling bytes() after Python 2's str() was an
> accident. It wasn't.

Sorry, didn't mean to imply that. More that we hadn't previously sat
down and thought through how best to clearly articulate this in the
docs, and hence some of the current inconsistencies hadn't become
clear.

>> This PEP aims to provide the missing systematic review, with the goal of
>> ensuring that wherever feasible (given backwards compatibility
>> constraints)
>> these current inconsistencies are addressed for the Python 3.5 release.
>
> I would like to convince you to aim lower, drop the "systematic review", and
> just focus on some changes that are likely to improve users' experience
> (which includes porting Python 2 code).

After re-reading the current docs (which I also wrote, but this aspect
was a mere afterthought at the time), I'm thinking a useful course of
action in parallel with this PEP will be for me to work on improving
the Python 3.4 docs for these types. The bits I find too hard to
explain then become fodder for tweaks in 3.5.

>> Proposals
>> =========

>> * more consistently accepting length 1 ``bytes`` objects as input where an
>>   integer between ``0`` and ``255`` inclusive is expected, and vice-versa
>
>
> Not sure I like this as a goal. OK, stronger: I don't like this goal.

Roger. As noted above, I now think we can address this by splitting
the API documentation instead, so that there's a clear "tuple/list of
ints" section and a "binary data with ASCII segments" section. Some
hybrid APIs (like the search ones) may appear in both. In terms of
analogies to other types:

Behaviour is common to tuple + str: hybrid API for bytes + bytearray
Behaviour is list-only: int-only API for bytearray
Behaviour is str-only: str-like only API for bytes + bytearray

Now that I've framed the question that way, I think I can not only
make it make sense in the docs, but I believe the 3.4 behaviour is
already pretty close to consistent with it.

The proposed bytes.byte() constructor would then more cleanly handle
the few cases where it may be desirable to pass an int to a str-like
API (such as replace())

>> * allowing users to easily convert integer output to a length 1 ``bytes``
>>   object
>
> I think you meant integer values instead of output?

Sort of - I was thinking of reversing the effects of indexing here.
That is, replacing the current:

    x = data[0:1]

with:

    x = bytes.byte(data[0])

> In Python 2 we did this
> with the global function chr(), but in Python 3 that creates a str(). (The
> history of chr() and ord() sa built-in functions is that they long predates
> the notion of methods (class- or otherwise), and their naming comes straight
> from Pascal.)
>
> Anyway, I don't know that the use case is so common that it needs more than
> bytes([i]) or bytearray([i]) -- if there is an argument to be made for
> bytes.byte(i) and bytearray.byte(i) it would be that the [i] in the
> constructor is somewhat hard to grasp.

Since I was mostly thinking about an alternative to slicing to convert
an index lookup back to a bytes object, this doesn't seem appealing to
me:

    x = bytes([data[0]])

The other one is that "bytes([i])" doesn't play nice with higher order
functions like map.

I don't expect wanting this to be *hugely* common, but I do think
there's value in having the primitive conversion operation implied by
the constructor behaviour available as a Python level operation.

>> Alternate Constructors
>> ----------------------
>>
>> The ``bytes`` and ``bytearray`` constructors currently accept an integer
>> argument, but interpret it to mean a zero-filled object of the given
>> length.
>
>
> This is one of the two legacies of the original "mutable bytes" design, and
> I agree we should strive to replace it -- although I think one round of
> deprecation may be too quick.

Postponing removal to 3.7 or indefinitely is fine by me.

While I think it should go away, I'm in no hurry to get rid of it - it
started bothering me less once I realised you can already safely call
bytes on arbitrary objects by passing them through memoryview first
(as that doesn't have the mutable legacy that causes problems with
integer input).

>> For ``bytes``, a ``byte`` constructor is proposed that converts integers
>> (as indicated by ``operator.index``)
>
> I know why you reference this, but it feels confusing to me. At this point
> in the narrative it's better to just say "integer" and explain how it
> decides "integer-ness" later.

Sounds good (I actually meant to double check that we *do* currently
accept arbitrary integer-like objects in the bytes constructor).

>> in the appropriate range to a ``bytes``
>> object, converts objects that support the buffer API to bytes, and also
>> passes through length 1 byte strings unchanged::
>
> I think the second half (accepting bytes instances of length 1) is wrong
> here and doesn't actually have a practical use case. I'll say more below.

Agreed, I now think "the binary equivalent of chr()" would be much
better behaviour here.

>> For ``bytearray``, a ``from_len`` constructor is proposed that
>> preallocates
>> the buffer filled with a particular value (default to ``0``) as a direct
>> replacement for the current constructor behaviour, rather than having to
>> use
>> sequence repetition to achieve the same effect in a less intuitive way::
>>
>>     >>> bytearray.from_len(3)
>>     bytearray(b'\x00\x00\x00')
>>     >>> bytearray.from_len(3, 6)
>>     bytearray(b'\x06\x06\x06')
>>
>> This part of the proposal was covered by an existing issue
>> [empty-buffer-issue]_ and a variety of names have been proposed
>> (``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The
>> specific name currently proposed was chosen by analogy with
>> ``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely
>> explicit that it is an alternate constructor rather than an in-place
>> mutation, as well as how it differs from the standard constructor.
>
> I think you need to brainstorm more on the name; from_len() looks pretty
> awkward. And I think it's better to add it to bytes() as well, since the two
> classes intentionally try to be as similar as possible.

I initially liked Barry's "fill" suggestion, but then realised it read
too much like an in-place operation (at least to my mind). Here are
some examples (using Brett's suggestion of a keyword only second
parameter):

    bytearray.zeros(3) # NumPy spelling, no configurable fill value
    bytearray.fill(3)
    bytearray.fill(3, fillvalue=6)
    bytearray.filled(3)
    bytearray.filled(3, fillvalue=6)

To be honest, I'm actually coming around to the "just copy the 'zeros'
name from NumPy and call it done" view on this one. I don't have a
concrete use case for a custom fill value, and I think I'll learn
quickly enough that it uses the shorter spelling.

>> Open questions
>> ^^^^^^^^^^^^^^
>> * Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary
>>   sequences with more than one element? The ``TypeError`` currently
>> proposed
>>   is copied (with slightly improved wording) from the behaviour of
>> ``ord()``
>>   with sequences containing more than one code point, while ``ValueError``
>>   would be more consistent with the existing handling of out-of-range
>>   integer values.
>
> It should not accept any bytes arguments. But if somehow you convince me
> otherwise, it should be ValueError (and honestly, ord() is wrong there).
>
>>
>> * ``bytes.byte()`` is defined above as accepting length 1 binary sequences
>>   as individual bytes, but this is currently inconsistent with the main
>>   ``bytes`` constructor::
>>
>>       >>> bytes([b"a", b"b", b"c"])
>>       Traceback (most recent call last):
>>         File "<stdin>", line 1, in <module>
>>       TypeError: 'bytes' object cannot be interpreted as an integer
>>
>>   Should the ``bytes`` constructor be changed to accept iterables of
>> length 1
>>   bytes objects in addition to iterables of integers? If so, should it
>>   allow a mixture of the two in a single iterable?
>
> Noooooooooooooooooooooooooo!!!!!

Yeah, it bothered me, too :)

As you suggest, I think it makes sense to extrapolate this the other
way and change the definition of bytes.byte() to be a true inverse of
ord() for binary data.

>> Iteration
>> ---------
>>
>> Iteration over ``bytes`` objects and other binary sequences produces
>> integers. Rather than proposing a new method that would need to be added
>> not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially
>> to third party types as well, this PEP proposes that iteration to produce
>> length 1 ``bytes`` objects instead be handled by combining ``map`` with
>> the new ``bytes.byte()`` alternate constructor proposed above::
>>
>>     for x in map(bytes.byte, data):
>>         # x is a length 1 ``bytes`` object, rather than an integer
>>         # This works with *any* container of integers in the range
>>         # 0 to 255 inclusive
>
> I can see why you don't like a new method, but this idiom is way too verbose
> and unintuitive to ever gain traction. Let's just add a new method to all
> three types, 3rd party types will get the message.

Fair enough. Is "iterbytes()" OK as the name?:

    for x in date.iterbytes():
        # x is a length 1 ``bytes`` object, rather than an integer
        # This works with *any* container of integers in the range
        # 0 to 255 inclusive

>> Consistent support for different input types
>> --------------------------------------------
>>
>> In Python 3.3, the binary search operations (``in``, ``count()``,
>> ``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to
>> accept integers in the range 0 to 255 (inclusive) as their first argument
>> (in addition to the existing support for binary sequences).
>
>
> I wonder if that wasn't a bit over-zealous. While 'in', count() and index()
> are sequence methods (looking for elements) that have an extended meaning
> (looking for substrings) for string types, the find() and r*() variants are
> only defined for strings.

I suspect they're using the same underlying search code, although I
haven't actually checked.

>> This PEP proposes extending that behaviour of accepting integers as being
>> equivalent to the corresponding length 1 binary sequence to several other
>> ``bytes`` and ``bytearray`` methods that currently expect a ``bytes``
>> object for certain parameters. In essence, if a value is an acceptable
>> input to the new ``bytes.byte`` constructor defined above, then it would
>> be acceptable in the roles defined here (in addition to any other already
>> supported inputs):
>>
>> * ``startswith()`` prefix(es)
>> * ``endswith()`` suffix(es)
>>
>> * ``center()`` fill character
>> * ``ljust()`` fill character
>> * ``rjust()`` fill character
>>
>> * ``strip()`` character to strip
>> * ``lstrip()`` character to strip
>> * ``rstrip()`` character to strip
>>
>> * ``partition()`` separator argument
>> * ``rpartition()`` separator argument
>>
>> * ``split()`` separator argument
>> * ``rsplit()`` separator argument
>>
>> * ``replace()`` old value and new value
>>
>> In addition to the consistency motive, this approach also makes it easier
>> to work with the indexing behaviour , as the result of an indexing
>> operation
>> can more easily be fed back in to other methods.
>
>
> I think herein lies madness. The intention seems to be to paper over as much
> as possible the unfortunate behavior of b[i]. But how often does any of
> these methods get called with such a construct? And how often will that be
> in a context where this is the *only* thing that is affected by b[i]
> returning an int in Python 3 but a string in Python 2? (In my experience
> these are mostly called with literal arguments, except inside wrapper
> functions that are themselves intended to be called with a literal
> argument.) Weakening the type checking here seems a bad idea -- it would
> accept integers in *any* context, and that would just cause more nasty
> debugging issues.

Yeah, I agree with this now. I was already starting to get "What's the
actual use case here?" vibes while writing the PEP, and your reaction
makes it easy to change my mind :)

"replace()" seems like the only one where a reasonable case might be
made to allowing integer input (and that's actually the one Brandon
was asking about that got me thinking along these lines in the first
place).

>> For ``bytearray``, some additional changes are proposed to the current
>> integer based operations to ensure they remain consistent with the
>> proposed
>> constructor changes::
>>
>> * ``append()``: updated to be consistent with ``bytes.byte()``
>> * ``remove()``: updated to be consistent with ``bytes.byte()``
>> * ``+=``: updated to be consistent with ``bytes()`` changes (if any)
>> * ``extend()``: updated to be consistent with ``bytes()`` changes (if any)
>
>
> Eew again. These are operations from the MutableSequence ABC and there is no
> reason to make their signatures fuzzier.


>> Acknowledgement of surprising behaviour of some ``bytearray`` methods
>> ---------------------------------------------------------------------
>>
>> Several of the ``bytes`` and ``bytearray`` methods have their origins in
>> the
>> Python 2 ``str`` API.
>
> You make it sound as if this is a bad thing or an accident.

Again, not my intention. I think that impression will be easier to
avoid once the PEP is recast as treating the issue as primarily a
documentation problem, with just a few minor API tweaks.

>> As ``str`` is an immutable type, all of these
>> operations are defined as returning a *new* instance, rather than
>> operating
>> in place. This contrasts with methods on other mutable types like
>> ``list``,
>> where ``list.sort()`` and ``list.reverse()`` operate in-place and return
>> ``None``, rather than creating a new object.
>
> So does bytestring.reverse(). And if you really insist we can add
> bytestring.sort(). :-)

Yeah, this all becomes substantially *less* surprising once these
types are documented as effectively exposing two mostly distinct APIs
(their underlying "container of ints" API for arbitrary binary data,
and then the additional str-like API for binary data with ASCII
compatible segments)

I'm not sure when I'll get the PEP updated (since this isn't an urgent
problem, I just wanted to get the initial draft of the PEP written
while the problem was fresh in my mind), but I think the end result
should be relatively non-controversial once I incorporate your
feedback.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-ideas mailing list