[Python-ideas] Fixing the Python 3 bytes constructor

Tue Apr 1 18:30:12 CEST 2014

Nice come-back! Responses inline.

On Tue, Apr 1, 2014 at 6:26 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 31 March 2014 02:05, Guido van Rossum <guido at python.org> wrote:
> > On Sat, Mar 29, 2014 at 7:17 PM, Nick Coghlan <ncoghlan at gmail.com>
> wrote:
>
 [...]

> >> Background
> >> ==========
> >>
> >> Over the course of Python 3's evolution, a number of adjustments have
> been
> >> made to the core ``bytes`` and ``bytearray`` types as additional
> practical
> >> experience was gained with using them in code beyond the Python 3
> standard
> >> library and test suite. However, to date, these changes have been made
> >> on a relatively ad hoc tactical basis as specific issues were
> identified,
> >> rather than as part of a systematic review of the APIs of these types.
> >
> > I'm not sure you can claim that. We probably have more information based
> on
> > experience now than when we did the redesign. (At that time most
> experience
> > was based on using str() for binary data.)
>
> Yeah, I was mostly thinking of the change to make the search APIs
> accept both integers and subsequences when I wrote that. I'll try to
> think of a better way of wording it.
>

That might have felt ad-hoc, but it was in line with the idea that bytes
follow the patterns of both tuple and string. (And similar for bytearray.)

> I also realised in reviewing the docs that a key part of the problem
> may actually be a shortcut I took in the sequence docs rewrite that I
> did quite a while ago now - the bytes/bytearray docs are currently
> written in terms of *how they differ from str*. They don't really
> cover the "container of integers" aspect particularly well. I now
> believe a better way to tackle that would to be upfront that these
> types basically have two APIs on a single class: their core tuple/list
> of integers "arbitrary binary data" API, and then the str-inspired
> "binary data with ASCII segments" API on top of that.
>

Right, and then you can follow up with details about how the APIs differ
from their equivalents in tuple and string. I imagine there are three
categories here (some may be empty): APIs that are the union of the
corresponding tuple and string APIs; APIs that are like one of the "base"
classes with some restrictions or extensions that can't be explained by
referring to the other "base"; and APIs that are unique to bytes.

> >> This
> >> approach has allowed inconsistencies to creep into the API design as to
> >> which
> >> input types are accepted by different methods. Additional
> inconsistencies
> >> linger from an earlier pre-release design where there was *no* separate
> >> ``bytearray`` type, and instead the core ``bytes`` type was mutable
> (with
> >> no immutable counterpart), as well as from the origins of these types in
> >> the text-like behaviour of the Python 2 ``str`` type.
> >
> > You make it sound as if modeling bytes() after Python 2's str() was an
> > accident. It wasn't.
>
> Sorry, didn't mean to imply that. More that we hadn't previously sat
> down and thought through how best to clearly articulate this in the
> docs, and hence some of the current inconsistencies hadn't become
> clear.
>

Heh, I get defensive when you say bad things about the language. Not so
much about the docs (too often I don't know what's in the docs myself
because my knowledge predates the docs :-).

> >> This PEP aims to provide the missing systematic review, with the goal of
> >> ensuring that wherever feasible (given backwards compatibility
> >> constraints)
> >> these current inconsistencies are addressed for the Python 3.5 release.
> >
> > I would like to convince you to aim lower, drop the "systematic review",
> and
> > just focus on some changes that are likely to improve users' experience
> > (which includes porting Python 2 code).
>
> After re-reading the current docs (which I also wrote, but this aspect
> was a mere afterthought at the time), I'm thinking a useful course of
> action in parallel with this PEP will be for me to work on improving
> the Python 3.4 docs for these types. The bits I find too hard to
> explain then become fodder for tweaks in 3.5.
>

I am very much in favor of this approach. Many API improvements I've made
myself have come from attempts to write documentation, and I imagine I'm
not unique.

> >> Proposals
> >> =========
>
> >> * more consistently accepting length 1 ``bytes`` objects as input where
> an
> >>   integer between ``0`` and ``255`` inclusive is expected, and
> vice-versa
> >
> >
> > Not sure I like this as a goal. OK, stronger: I don't like this goal.
>
> Roger. As noted above, I now think we can address this by splitting
> the API documentation instead, so that there's a clear "tuple/list of
> ints" section and a "binary data with ASCII segments" section. Some
> hybrid APIs (like the search ones) may appear in both. In terms of
> analogies to other types:
>
> Behaviour is common to tuple + str: hybrid API for bytes + bytearray
> Behaviour is list-only: int-only API for bytearray
> Behaviour is str-only: str-like only API for bytes + bytearray
>

Heh, I think I just accidentally reinvented that same categorization above.
:-)

> Now that I've framed the question that way, I think I can not only
> make it make sense in the docs, but I believe the 3.4 behaviour is
> already pretty close to consistent with it.
>
> The proposed bytes.byte() constructor would then more cleanly handle
> the few cases where it may be desirable to pass an int to a str-like
> API (such as replace())
>
> >> * allowing users to easily convert integer output to a length 1
> ``bytes``
> >>   object
> >
> > I think you meant integer values instead of output?
>
> Sort of - I was thinking of reversing the effects of indexing here.
> That is, replacing the current:
>
>     x = data[0:1]
>
> with:
>
>     x = bytes.byte(data[0])
>

Hm. I don't find that very attractive. You can't write Python 2/3 code
using that idiom, and it's a lot longer than the original. The only
redeeming feature is that it clearly fails when data is empty, and possibly
that you don't have to compute the second index (which could be awkward if
the first index is an expression).

I'm not denying that we need bytes.byte(), but this doesn't sound like much
of a motivation. Just pointing to the need of bytes/bytestring equivalents
for chr() makes more sense to me.

>  > In Python 2 we did this
> > with the global function chr(), but in Python 3 that creates a str().
> (The
> > history of chr() and ord() as built-in functions is that they long
> predates
> > the notion of methods (class- or otherwise), and their naming comes
> straight
> > from Pascal.)
> >
> > Anyway, I don't know that the use case is so common that it needs more
> than
> > bytes([i]) or bytearray([i]) -- if there is an argument to be made for
> > bytes.byte(i) and bytearray.byte(i) it would be that the [i] in the
> > constructor is somewhat hard to grasp.
>
> Since I was mostly thinking about an alternative to slicing to convert
> an index lookup back to a bytes object, this doesn't seem appealing to
> me:
>
>     x = bytes([data[0]])
>

Fair enough.

> The other one is that "bytes([i])" doesn't play nice with higher order
> functions like map.
>

Also fair enough; having to define a helper function feels bad. All in all,
I do think we need bytes.byte() and bytearray.byte(). We may just have to
fine-tune the motivation a bit. :-)

> I don't expect wanting this to be *hugely* common, but I do think
> there's value in having the primitive conversion operation implied by
> the constructor behaviour available as a Python level operation.
>

Yes.

>  >> Alternate Constructors
> >> ----------------------
> >>
> >> The ``bytes`` and ``bytearray`` constructors currently accept an integer
> >> argument, but interpret it to mean a zero-filled object of the given
> >> length.
> >
> >
> > This is one of the two legacies of the original "mutable bytes" design,
> and
> > I agree we should strive to replace it -- although I think one round of
> > deprecation may be too quick.
>
> Postponing removal to 3.7 or indefinitely is fine by me.
>
> While I think it should go away, I'm in no hurry to get rid of it - it
> started bothering me less once I realised you can already safely call
> bytes on arbitrary objects by passing them through memoryview first
> (as that doesn't have the mutable legacy that causes problems with
> integer input).
>

I'm not sure I quite see the use case. memoryview() doesn't take "arbitrary
objects" -- it takes objects that implement the buffer protocol (if that's
still the name :-). Are you saying that the advantage of going through
memoryview() is that it fails fast when you accidentally pass it an
integer(-like object)?

[...]

> >> For ``bytearray``, a ``from_len`` constructor is proposed that
> >> preallocates
> >> the buffer filled with a particular value (default to ``0``) as a direct
> >> replacement for the current constructor behaviour, rather than having to
> >> use
> >> sequence repetition to achieve the same effect in a less intuitive way::
> >>
> >>     >>> bytearray.from_len(3)
> >>     bytearray(b'\x00\x00\x00')
> >>     >>> bytearray.from_len(3, 6)
> >>     bytearray(b'\x06\x06\x06')
> >>
> >> This part of the proposal was covered by an existing issue
> >> [empty-buffer-issue]_ and a variety of names have been proposed
> >> (``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The
> >> specific name currently proposed was chosen by analogy with
> >> ``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely
> >> explicit that it is an alternate constructor rather than an in-place
> >> mutation, as well as how it differs from the standard constructor.
> >
> > I think you need to brainstorm more on the name; from_len() looks pretty
> > awkward. And I think it's better to add it to bytes() as well, since the
> two
> > classes intentionally try to be as similar as possible.
>
> I initially liked Barry's "fill" suggestion, but then realised it read
> too much like an in-place operation (at least to my mind). Here are
> some examples (using Brett's suggestion of a keyword only second
> parameter):
>
>     bytearray.zeros(3) # NumPy spelling, no configurable fill value
>     bytearray.fill(3)
>     bytearray.fill(3, fillvalue=6)
>     bytearray.filled(3)
>     bytearray.filled(3, fillvalue=6)
>
> To be honest, I'm actually coming around to the "just copy the 'zeros'
> name from NumPy and call it done" view on this one. I don't have a
> concrete use case for a custom fill value, and I think I'll learn
> quickly enough that it uses the shorter spelling.
>

+1

[...]

>  >> * ``bytes.byte()`` is defined above as accepting length 1 binary
> sequences
> >>   as individual bytes, but this is currently inconsistent with the main
> >>   ``bytes`` constructor::
> >>
> >>       >>> bytes([b"a", b"b", b"c"])
> >>       Traceback (most recent call last):
> >>         File "<stdin>", line 1, in <module>
> >>       TypeError: 'bytes' object cannot be interpreted as an integer
> >>
> >>   Should the ``bytes`` constructor be changed to accept iterables of
> >> length 1
> >>   bytes objects in addition to iterables of integers? If so, should it
> >>   allow a mixture of the two in a single iterable?
> >
> > Noooooooooooooooooooooooooo!!!!!
>
> Yeah, it bothered me, too :)
>
> As you suggest, I think it makes sense to extrapolate this the other
> way and change the definition of bytes.byte() to be a true inverse of
> ord() for binary data.
>

+1

> >> Iteration
> >> ---------
> >>
> >> Iteration over ``bytes`` objects and other binary sequences produces
> >> integers. Rather than proposing a new method that would need to be added
> >> not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially
> >> to third party types as well, this PEP proposes that iteration to
> produce
> >> length 1 ``bytes`` objects instead be handled by combining ``map`` with
> >> the new ``bytes.byte()`` alternate constructor proposed above::
> >>
> >>     for x in map(bytes.byte, data):
> >>         # x is a length 1 ``bytes`` object, rather than an integer
> >>         # This works with *any* container of integers in the range
> >>         # 0 to 255 inclusive
> >
> > I can see why you don't like a new method, but this idiom is way too
> verbose
> > and unintuitive to ever gain traction. Let's just add a new method to all
> > three types, 3rd party types will get the message.
>
> Fair enough. Is "iterbytes()" OK as the name?:
>
>     for x in date.iterbytes():
>         # x is a length 1 ``bytes`` object, rather than an integer
>         # This works with *any* container of integers in the range
>         # 0 to 255 inclusive
>

+1

> >> Consistent support for different input types
> >> --------------------------------------------
> >>
> >> In Python 3.3, the binary search operations (``in``, ``count()``,
> >> ``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to
> >> accept integers in the range 0 to 255 (inclusive) as their first
> argument
> >> (in addition to the existing support for binary sequences).
> >
> >
> > I wonder if that wasn't a bit over-zealous. While 'in', count() and
> index()
> > are sequence methods (looking for elements) that have an extended meaning
> > (looking for substrings) for string types, the find() and r*() variants
> are
> > only defined for strings.
>
> I suspect they're using the same underlying search code, although I
> haven't actually checked.
>

OK, it's water under the bridge anyway.

[...]

> "replace()" seems like the only one where a reasonable case might be
> made to allowing integer input (and that's actually the one Brandon
> was asking about that got me thinking along these lines in the first
> place).
>

I think not. It really works on substrings, length-one strings are just a
common case.

[...]

> I'm not sure when I'll get the PEP updated (since this isn't an urgent
> problem, I just wanted to get the initial draft of the PEP written
> while the problem was fresh in my mind), but I think the end result
> should be relatively non-controversial once I incorporate your
> feedback.
>

No hurries. And you're welcome!

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140401/dfbf515e/attachment-0001.html>