<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Sat, Mar 29, 2014 at 7:17 PM, Nick Coghlan <span dir="ltr"><<a href="mailto:ncoghlan@gmail.com" target="_blank">ncoghlan@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">On 30 March 2014 07:07, Nick Coghlan <<a href="mailto:ncoghlan@gmail.com">ncoghlan@gmail.com</a>> wrote:<br>


> I already have a draft PEP written that covers the constructor issue,<br>

> iteration and adding acceptance of integer inputs to the remaining<br>

> methods that don't currently handle them. There was some background<br>

> explanation of the text/binary domain split in the Python 2->3<br>

> transition that I wanted Guido's feedback on before posting, but I<br>

> just realised I can cut that out for now, and then add it back after<br>

> Guido has had a chance to review it.<br>

><br>

> So I'll tidy that up and get the draft posted later today.<br>

<br>

</div>Guido pointed out most of the stuff I had asked him to look at wasn't<br>

actually relevant to the PEP, so I just cut most of it entirely.<br>

Suffice to say, after stepping back and reviewing them systematically<br>

for the first time in years, I believe the APIs for the core binary<br>

data types in Python 3 could do with a little sprucing up :)<br></blockquote><div><br></div><div>Thanks for cutting it down, it's easier to concentrate on the essentials now.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Web version: <a href="http://www.python.org/dev/peps/pep-0467/" target="_blank">http://www.python.org/dev/peps/pep-0467/</a><br>

<br>

======================================<br>

PEP: 467<br>

Title: Improved API consistency for bytes and bytearray<br>

Version: $Revision$<br>

Last-Modified: $Date$<br>

Author: Nick Coghlan <<a href="mailto:ncoghlan@gmail.com">ncoghlan@gmail.com</a>><br>

Status: Draft<br>

Type: Standards Track<br>

Content-Type: text/x-rst<br>

Created: 2014-03-30<br>

Python-Version: 3.5<br>

Post-History: 2014-03-30<br>

<br>

<br>

Abstract<br>

========<br>

<br>

During the initial development of the Python 3 language specification, the<br>

core ``bytes`` type for arbitrary binary data started as the mutable type<br>

that is now referred to as ``bytearray``. Other aspects of operating in<br>

the binary domain in Python have also evolved over the course of the Python<br>

3 series.<br>

<br>

This PEP proposes a number of small adjustments to the APIs of the ``bytes``<br>

and ``bytearray`` types to make their behaviour more internally consistent<br>

and to make it easier to operate entirely in the binary domain for use cases<br>

that actually involve manipulating binary data directly, rather than<br>

converting it to a more structured form with additional modelling<br>

semantics (such as ``str``) and then converting back to binary format after<br>

processing.<br></blockquote><div><br></div><div>I hope you don't mind I cut the last 60% of this sentence (everything after "binary domain").<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

<br>

Background<br>

==========<br>

<br>

Over the course of Python 3's evolution, a number of adjustments have been<br>

made to the core ``bytes`` and ``bytearray`` types as additional practical<br>

experience was gained with using them in code beyond the Python 3 standard<br>

library and test suite. However, to date, these changes have been made<br>

on a relatively ad hoc tactical basis as specific issues were identified,<br>

rather than as part of a systematic review of the APIs of these types.</blockquote><div><br></div><div>I'm not sure you can claim that. We probably have more information based on experience now than when we did the redesign. (At that time most experience was based on using str() for binary data.)<br>


</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">This<br>

approach has allowed inconsistencies to creep into the API design as to which<br>

input types are accepted by different methods. Additional inconsistencies<br>

linger from an earlier pre-release design where there was *no* separate<br>

``bytearray`` type, and instead the core ``bytes`` type was mutable (with<br>

no immutable counterpart), as well as from the origins of these types in<br>

the text-like behaviour of the Python 2 ``str`` type.<br></blockquote><div><br></div><div>You make it sound as if modeling bytes() after Python 2's str() was an accident. It wasn't.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


This PEP aims to provide the missing systematic review, with the goal of<br>

ensuring that wherever feasible (given backwards compatibility constraints)<br>

these current inconsistencies are addressed for the Python 3.5 release.<br></blockquote><div><br></div><div>I would like to convince you to aim lower, drop the "systematic review", and just focus on some changes that are likely to improve users' experience (which includes porting Python 2 code).<br>


</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

Proposals<br>

=========<br>

<br>

As a "consistency improvement" proposal, this PEP is actually about a number<br>

of smaller micro-proposals, each aimed at improving the self-consistency of<br>

the binary data model in Python 3. Proposals are motivated by one of three<br>

factors:<br>

<br>

* removing remnants of the original design of ``bytes`` as a mutable type<br></blockquote><div><br></div><div>Yes.<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


* more consistently accepting length 1 ``bytes`` objects as input where an<br>

  integer between ``0`` and ``255`` inclusive is expected, and vice-versa<br></blockquote><div><br></div><div>Not sure I like this as a goal. OK, stronger: I don't like this goal.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


* allowing users to easily convert integer output to a length 1 ``bytes``<br>

  object<br></blockquote><div><br></div><div>I think you meant integer values instead of output? In Python 2 we did this with the global function chr(), but in Python 3 that creates a str(). (The history of chr() and ord()  sa built-in functions is that they long predates the notion of methods (class- or otherwise), and their naming comes straight from Pascal.)<br>


<br></div><div>Anyway, I don't know that the use case is so common that it needs more than bytes([i]) or bytearray([i]) -- if there is an argument to be made for bytes.byte(i) and bytearray.byte(i) it would be that the [i] in the constructor is somewhat hard to grasp.<br>


</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

Alternate Constructors<br>

----------------------<br>

<br>

The ``bytes`` and ``bytearray`` constructors currently accept an integer<br>

argument, but interpret it to mean a zero-filled object of the given length.<br></blockquote><div><br></div><div>This is one of the two legacies of the original "mutable bytes" design, and I agree we should strive to replace it -- although I think one round of deprecation may be too quick. (The other legacy is of course that b[i] is an int, not a bytes -- it's the worse problem, but I don't think we can fix it without breaking more than the fix would be worth.)<br>


</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

This is a legacy of the original design of ``bytes`` as a mutable type,<br>

rather than a particularly intuitive behaviour for users. It has become<br>

especially confusing now that other ``bytes`` interfaces treat integers<br>

and the corresponding length 1 bytes instances as equivalent input.<br>

Compare::<br>

<br>

    >>> b"\x03" in bytes([1, 2, 3])<br>

    True<br>

    >>> 3 in bytes([1, 2, 3])<br>

    True<br>

<br>

    >>> bytes(b"\x03")<br>

    b'\x03'<br>

<div class="">    >>> bytes(3)<br>

    b'\x00\x00\x00'<br>

<br>

</div>This PEP proposes that the current handling of integers in the bytes and<br>

bytearray constructors by deprecated in Python 3.5 and removed in Python<br>

3.6, being replaced by two more type appropriate alternate constructors<br>

provided as class methods. The initial python-ideas thread [ideas-thread1]_<br>

that spawned this PEP was specifically aimed at deprecating this constructor<br>

behaviour.<br>

<br>

For ``bytes``, a ``byte`` constructor is proposed that converts integers<br>

(as indicated by ``operator.index``)</blockquote><div><br></div><div>I know why you reference this, but it feels confusing to me. At this point in the narrative it's better to just say "integer" and explain how it decides "integer-ness" later.<br>


</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">in the appropriate range to a ``bytes``<br>

object, converts objects that support the buffer API to bytes, and also<br>

passes through length 1 byte strings unchanged::<br></blockquote><div><br></div><div>I think the second half (accepting bytes instances of length 1) is wrong here and doesn't actually have a practical use case. I'll say more below.<br>


  <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

    >>> bytes.byte(3)<br>

    b'\x03'<br>

    >>> bytes.byte(bytearray(bytes([3])))<br>

    b'\x03'<br>

    >>> bytes.byte(memoryview(bytes([3])))<br>

    b'\x03'<br>

    >>> bytes.byte(bytes([3]))<br>

    b'\x03'<br>

    >>> bytes.byte(512)<br>

<div class="">    Traceback (most recent call last):<br>

      File "<stdin>", line 1, in <module><br>

</div>    ValueError: bytes must be in range(0, 256)<br>

    >>> bytes.byte(b"ab")<br>

<div class="">    Traceback (most recent call last):<br>

      File "<stdin>", line 1, in <module><br>

</div>    TypeError: bytes.byte() expected a byte, but buffer of length 2 found<br>

<br>

One specific use case for this alternate constructor is to easily convert<br>

the result of indexing operations on ``bytes`` and other binary sequences<br>

from an integer to a ``bytes`` object. The documentation for this API<br>

should note that its counterpart for the reverse conversion is ``ord()``.<br></blockquote><div><br></div><div>However, in a pinch, b[0] will do as well, assuming you don't need the length check implied by ord(). <br>


</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

For ``bytearray``, a ``from_len`` constructor is proposed that preallocates<br>

the buffer filled with a particular value (default to ``0``) as a direct<br>

replacement for the current constructor behaviour, rather than having to use<br>

sequence repetition to achieve the same effect in a less intuitive way::<br>

<br>

    >>> bytearray.from_len(3)<br>

    bytearray(b'\x00\x00\x00')<br>

    >>> bytearray.from_len(3, 6)<br>

    bytearray(b'\x06\x06\x06')<br>

<br>

This part of the proposal was covered by an existing issue<br>

[empty-buffer-issue]_ and a variety of names have been proposed<br>

(``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The<br>

specific name currently proposed was chosen by analogy with<br>

``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely<br>

explicit that it is an alternate constructor rather than an in-place<br>

mutation, as well as how it differs from the standard constructor.<br></blockquote><div><br></div><div>I think you need to brainstorm more on the name; from_len() looks pretty awkward. And I think it's better to add it to bytes() as well, since the two classes intentionally try to be as similar as possible.<br>


</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

Open questions<br>

^^^^^^^^^^^^^^<br>

<br>

* Should ``bytearray.byte()`` also be added? Or is<br>

  ``bytearray(bytes.byte(x))`` sufficient for that case?<br></blockquote><div><br></div><div>It should be added.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


* Should ``bytes.from_len()`` also be added? Or is sequence repetition<br>

  sufficient for that case?<br></blockquote><div><br></div><div>It should be added.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

* Should ``bytearray.from_len()`` use a different name?<br></blockquote><div><br></div><div>Yes.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

* Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary<br>

  sequences with more than one element? The ``TypeError`` currently proposed<br>

  is copied (with slightly improved wording) from the behaviour of ``ord()``<br>

  with sequences containing more than one code point, while ``ValueError``<br>

  would be more consistent with the existing handling of out-of-range<br>

  integer values.<br></blockquote><div><br></div><div>It should not accept any bytes arguments. But if somehow you convince me otherwise, it should be ValueError (and honestly, ord() is wrong there).<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


* ``bytes.byte()`` is defined above as accepting length 1 binary sequences<br>

  as individual bytes, but this is currently inconsistent with the main<br>

  ``bytes`` constructor::<br>

<br>

      >>> bytes([b"a", b"b", b"c"])<br>

<div class="">      Traceback (most recent call last):<br>

        File "<stdin>", line 1, in <module><br>

</div>      TypeError: 'bytes' object cannot be interpreted as an integer<br>

<br>

  Should the ``bytes`` constructor be changed to accept iterables of length 1<br>

  bytes objects in addition to iterables of integers? If so, should it<br>

  allow a mixture of the two in a single iterable?<br></blockquote><div><br></div><div>Noooooooooooooooooooooooooo!!!!!<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

<br>

Iteration<br>

---------<br>

<br>

Iteration over ``bytes`` objects and other binary sequences produces<br>

integers. Rather than proposing a new method that would need to be added<br>

not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially<br>

to third party types as well, this PEP proposes that iteration to produce<br>

length 1 ``bytes`` objects instead be handled by combining ``map`` with<br>

the new ``bytes.byte()`` alternate constructor proposed above::<br>

<div class=""><br>

    for x in map(bytes.byte, data):<br>

</div>        # x is a length 1 ``bytes`` object, rather than an integer<br>

        # This works with *any* container of integers in the range<br>

        # 0 to 255 inclusive<br></blockquote><div><br></div><div>I can see why you don't like a new method, but this idiom is way too verbose and unintuitive to ever gain traction. Let's just add a new method to all three types, 3rd party types will get the message.<br>


</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

Consistent support for different input types<br>

--------------------------------------------<br>

<br>

In Python 3.3, the binary search operations (``in``, ``count()``,<br>

``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to<br>

accept integers in the range 0 to 255 (inclusive) as their first argument<br>

(in addition to the existing support for binary sequences).<br></blockquote><div><br></div><div>I wonder if that wasn't a bit over-zealous. While 'in', count() and index() are sequence methods (looking for elements) that have an extended meaning (looking for substrings) for string types, the find() and r*() variants are only defined for strings.<br>


</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

This PEP proposes extending that behaviour of accepting integers as being<br>

equivalent to the corresponding length 1 binary sequence to several other<br>

``bytes`` and ``bytearray`` methods that currently expect a ``bytes``<br>

object for certain parameters. In essence, if a value is an acceptable<br>

input to the new ``bytes.byte`` constructor defined above, then it would<br>

be acceptable in the roles defined here (in addition to any other already<br>

supported inputs):<br>

<br>

* ``startswith()`` prefix(es)<br>

* ``endswith()`` suffix(es)<br>

<br>

* ``center()`` fill character<br>

* ``ljust()`` fill character<br>

* ``rjust()`` fill character<br>

<br>

* ``strip()`` character to strip<br>

* ``lstrip()`` character to strip<br>

* ``rstrip()`` character to strip<br>

<br>

* ``partition()`` separator argument<br>

* ``rpartition()`` separator argument<br>

<br>

* ``split()`` separator argument<br>

* ``rsplit()`` separator argument<br>

<br>

* ``replace()`` old value and new value<br>

<br>

In addition to the consistency motive, this approach also makes it easier<br>

to work with the indexing behaviour , as the result of an indexing operation<br>

can more easily be fed back in to other methods.<br></blockquote><div><br></div><div>I think herein lies madness. The intention seems to be to paper over as much as possible the unfortunate behavior of b[i]. But how often does any of these methods get called with such a construct? And how often will that be in a context where this is the *only* thing that is affected by b[i] returning an int in Python 3 but a string in Python 2? (In my experience these are mostly called with literal arguments, except inside wrapper functions that are themselves intended to be called with a literal argument.) Weakening the type checking here seems a bad idea -- it would accept integers in *any* context, and that would just cause more nasty debugging issues.<br>


<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

For ``bytearray``, some additional changes are proposed to the current<br>

integer based operations to ensure they remain consistent with the proposed<br>

constructor changes::<br>

<br>

* ``append()``: updated to be consistent with ``bytes.byte()``<br>

* ``remove()``: updated to be consistent with ``bytes.byte()``<br>

* ``+=``: updated to be consistent with ``bytes()`` changes (if any)<br>

* ``extend()``: updated to be consistent with ``bytes()`` changes (if any)<br></blockquote><div><br></div><div>Eew again. These are operations from the MutableSequence ABC and there is no reason to make their signatures fuzzier.<br>


 <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

Acknowledgement of surprising behaviour of some ``bytearray`` methods<br>

---------------------------------------------------------------------<br>

<br>

Several of the ``bytes`` and ``bytearray`` methods have their origins in the<br>

Python 2 ``str`` API.</blockquote><div><br></div><div>You make it sound as if this is a bad thing or an accident.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


As ``str`` is an immutable type, all of these<br>

operations are defined as returning a *new* instance, rather than operating<br>

in place. This contrasts with methods on other mutable types like ``list``,<br>

where ``list.sort()`` and ``list.reverse()`` operate in-place and return<br>

``None``, rather than creating a new object.<br></blockquote><div><br></div><div>So does bytestring.reverse(). And if you really insist we can add bytestring.sort(). :-)<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

Backwards compatibility constraints make it impractical to change this<br>

behaviour at this point, but it may be appropriate to explicitly call out<br>

this quirk in the documentation for the ``bytearray`` type. It affects the<br>

following methods that could reasonably be expected to operate in-place on<br>

a mutable type:<br>

<br>

* ``center()``<br>

* ``ljust()``<br>

* ``rjust()``<br>

* ``strip()``<br>

* ``lstrip()``<br>

* ``rstrip()``<br>

* ``replace()``<br>

* ``lower()``<br>

* ``upper()``<br>

* ``swapcase()``<br>

* ``title()``<br>

* ``capitalize()``<br>

* ``translate()``<br>

* ``expandtabs()``<br>

* ``zfill()``<br></blockquote><div><br></div><div>That all feels like hypercorrection. These are string methods and it would be completely wrong if bytearray changed them to modify the object in-place. I also don't see why anyone would think these would modify the object, given that everybody encounters these first for the str() type, then for bytes(), then finally (by extension) for bytearray().<br>


<br>The *only* place where there should be any confusion about whether the value is mutated or the variable is updated with a new object would be the += operator (and *=) but that's due to that operator's ambiguity.<br>


</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Note that the following ``bytearray`` operations *do* operate in place, as<br>

they're part of the mutable sequence API in ``bytearray``, rather than being<br>

inspired by the immutable Python 2 ``str`` API:<br>

<br>

* ``+=``<br>

* ``append()``<br>

* ``extend()``<br>

* ``reverse()``<br>

* ``remove()``<br>

* ``pop()``<br></blockquote><div><br></div><div>Right. And there's nothing wrong with this. <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

References<br>

==========<br>

<br>

.. [ideas-thread1]<br>

<a href="https://mail.python.org/pipermail/python-ideas/2014-March/027295.html" target="_blank">https://mail.python.org/pipermail/python-ideas/2014-March/027295.html</a><br>

.. [empty-buffer-issue] <a href="http://bugs.python.org/issue20895" target="_blank">http://bugs.python.org/issue20895</a><br>

<br>

<br>

Copyright<br>

=========<br>

<br>

This document has been placed in the public domain.<br>

<div class="HOEnZb"><div class="h5"><br>

--<br>

Nick Coghlan   |   <a href="mailto:ncoghlan@gmail.com">ncoghlan@gmail.com</a>   |   Brisbane, Australia<br>

_______________________________________________<br>

Python-ideas mailing list<br>

<a href="mailto:Python-ideas@python.org">Python-ideas@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/python-ideas" target="_blank">https://mail.python.org/mailman/listinfo/python-ideas</a><br>

Code of Conduct: <a href="http://python.org/psf/codeofconduct/" target="_blank">http://python.org/psf/codeofconduct/</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)

</div></div>