PEP 467: Minor bytes and bytearray improvements
The final PEP with SC feedback incorporated and one last addition: `bytes.ascii` as a replacement for the Python 2 idiom of `str(some_var)` to get the bytes version, and the Python 3 workaround of either the correct `str(some_var).encode('astii') or the potentially wrong `ascii(some_var).encode('ascii'). The rendered version is at https://www.python.org/dev/peps/pep-0467/ Happy reading! PEP: 467 Title: Minor API improvements for binary sequences Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan <ncoghlan@gmail.com>, Ethan Furman <ethan@stoneleaf.us> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 30-Mar-2014 Python-Version: 3.11 Post-History: 2014-03-30 2014-08-15 2014-08-16 2016-06-07 2016-09-01 2021-04-13 2021-11-03 Abstract ======== This PEP proposes five small adjustments to the APIs of the ``bytes`` and ``bytearray`` types to make it easier to operate entirely in the binary domain: * Add ``fromsize`` alternative constructor * Add ``fromint`` alternative constructor * Add ``ascii`` alternative constructor * Add ``getbyte`` byte retrieval method * Add ``iterbytes`` alternative iterator Rationale ========= During the initial development of the Python 3 language specification, the core ``bytes`` type for arbitrary binary data started as the mutable type that is now referred to as ``bytearray``. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series, for example with PEP 461. Motivation ========== With Python 3 and the split between ``str`` and ``bytes``, one small but important area of programming became slightly more difficult, and much more painful -- wire format protocols. This area of programming is characterized by a mixture of binary data and ASCII compatible segments of text (aka ASCII-encoded text). The addition of the new constructors, methods, and iterators will aid both in writing new wire format code, and in porting any remaining Python 2 wire format code. Common use-cases include ``dbf`` and ``pdf`` file formats, ``email`` formats, and ``FTP`` and ``HTTP`` communications, among many others. Proposals ========= Addition of explicit "count and byte initialised sequence" constructors ----------------------------------------------------------------------- To replace the now discouraged behavior, this PEP proposes the addition of an explicit ``fromsize`` alternative constructor as a class method on both ``bytes`` and ``bytearray`` whose first argument is the count, and whose second argument is the fill byte to use (defaults to ``\x00``):: >>> bytes.fromsize(3) b'\x00\x00\x00' >>> bytearray.fromsize(3) bytearray(b'\x00\x00\x00') >>> bytes.fromsize(5, b'\x0a') b'\x0a\x0a\x0a\x0a\x0a' >>> bytearray.fromsize(5, fill=b'\x0a') bytearray(b'\x0a\x0a\x0a\x0a\x0a') ``fromsize`` will behave just as the current constructors behave when passed a single integer, while allowing for non-zero fill values when needed. Addition of explicit "single byte" constructors ----------------------------------------------- As binary counterparts to the text ``chr`` function, this PEP proposes the addition of an explicit ``fromint`` alternative constructor as a class method on both ``bytes`` and ``bytearray``:: >>> bytes.fromint(65) b'A' >>> bytearray.fromint(65) bytearray(b'A') These methods will only accept integers in the range 0 to 255 (inclusive):: >>> bytes.fromint(512) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: integer must be in range(0, 256) >>> bytes.fromint(1.0) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'float' object cannot be interpreted as an integer The documentation of the ``ord`` builtin will be updated to explicitly note that ``bytes.fromint`` is the primary inverse operation for binary data, while ``chr`` is the inverse operation for text data, and that ``bytearray.fromint`` also exists. Behaviorally, ``bytes.fromint(x)`` will be equivalent to the current ``bytes([x])`` (and similarly for ``bytearray``). The new spelling is expected to be easier to discover and easier to read (especially when used in conjunction with indexing operations on binary sequence types). As a separate method, the new spelling will also work better with higher order functions like ``map``. These new methods intentionally do NOT offer the same level of general integer support as the existing ``int.to_bytes`` conversion method, which allows arbitrarily large integers to be converted to arbitrarily long bytes objects. The restriction to only accept positive integers that fit in a single byte means that no byte order information is needed, and there is no need to handle negative numbers. The documentation of the new methods will refer readers to ``int.to_bytes`` for use cases where handling of arbitrary integers is needed. Addition of "ascii" constructors -------------------------------- In Python 2 converting an object, such as the integer ``123``, to bytes (aka the Python 2 ``str``) was as simple as:: >>> str(123) '123' With Python 3 that became the more verbose:: >>> b'%d' % 123 or even:: >>> str(123).encode('ascii') This PEP proposes that an ``ascii`` method be added to ``bytes`` and ``bytearray`` to handle this use-case:: >>> bytes.ascii(123) b'123' Note that ``bytes.ascii()`` would handle simple ascii-encodable text correctly, unlike the `ascii()`` built-in:: >>> ascii("hello").encode('ascii') b"'hello'" Addition of "getbyte" method to retrieve a single byte ------------------------------------------------------ This PEP proposes that ``bytes`` and ``bytearray`` gain the method ``getbyte`` which will always return ``bytes``:: >>> b'abc'.getbyte(0) b'a' If an index is asked for that doesn't exist, ``IndexError`` is raised:: >>> b'abc'.getbyte(9) Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: index out of range Addition of optimised iterator methods that produce ``bytes`` objects --------------------------------------------------------------------- This PEP proposes that ``bytes`` and ``bytearray`` gain an optimised ``iterbytes`` method that produces length 1 ``bytes`` objects rather than integers:: for x in data.iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer For example:: >>> tuple(b"ABC".iterbytes()) (b'A', b'B', b'C') Design discussion ================= Why not rely on sequence repetition to create zero-initialised sequences? ------------------------------------------------------------------------- Zero-initialised sequences can be created via sequence repetition:: >>> b'\x00' * 3 b'\x00\x00\x00' >>> bytearray(b'\x00') * 3 bytearray(b'\x00\x00\x00') However, this was also the case when the ``bytearray`` type was originally designed, and the decision was made to add explicit support for it in the type constructor. The immutable ``bytes`` type then inherited that feature when it was introduced in PEP 3137. This PEP isn't revisiting that original design decision, just changing the spelling as users sometimes find the current behavior of the binary sequence constructors surprising. In particular, there's a reasonable case to be made that ``bytes(x)`` (where ``x`` is an integer) should behave like the ``bytes.fromint(x)`` proposal in this PEP. Providing both behaviors as separate class methods avoids that ambiguity. Omitting the originally proposed builtin function ------------------------------------------------- When submitted to the Steering Council, this PEP proposed the introduction of a ``bchr`` builtin (with the same behaviour as ``bytes.fromint``), recreating the ``ord``/``chr``/``unichr`` trio from Python 2 under a different naming scheme (``ord``/``bchr``/``chr``). The SC indicated they didn't think this functionality was needed often enough to justify offering two ways of doing the same thing, especially when one of those ways was a new builtin function. That part of the proposal was therefore dropped as being redundant with the ``bytes.fromint`` alternate constructor. Developers that use this method frequently will instead have the option to define their own ``bchr = bytes.fromint`` aliases. Scope limitation: memoryview ---------------------------- Updating ``memoryview`` with the new item retrieval methods is outside the scope of this PEP. References ========== .. [1] Initial March 2014 discussion thread on python-ideas (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html) .. [2] Guido's initial feedback in that thread (https://mail.python.org/pipermail/python-ideas/2014-March/027376.html) .. [3] Issue proposing moving zero-initialised sequences to a dedicated API (http://bugs.python.org/issue20895) .. [4] Issue proposing to use calloc() for zero-initialised binary sequences (http://bugs.python.org/issue21644) .. [5] August 2014 discussion thread on python-dev (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html) .. [6] June 2016 discussion thread on python-dev (https://mail.python.org/pipermail/python-dev/2016-June/144875.html) Copyright ========= This document has been placed in the public domain.
On Thu, Nov 4, 2021 at 12:01 AM Ethan Furman <ethan@stoneleaf.us> wrote:
>>> bytearray.fromsize(5, fill=b'\x0a') bytearray(b'\x0a\x0a\x0a\x0a\x0a')
What happens if you supply more than one byte for the fill argument? Silent truncation, raise ValueError('too long') or ???
On Thu, Nov 4, 2021 at 10:37 AM Eric Fahlgren <ericfahlgren@gmail.com> wrote:
On Thu, Nov 4, 2021 at 12:01 AM Ethan Furman <ethan@stoneleaf.us> wrote:
>>> bytearray.fromsize(5, fill=b'\x0a') bytearray(b'\x0a\x0a\x0a\x0a\x0a')
What happens if you supply more than one byte for the fill argument? Silent truncation, raise ValueError('too long') or ???
It would seem reasonable to me for a multi-byte sequence to be filled as-is in a repeating pattern, perhaps truncating the last repetition if len(fill) is not an even multiple of the size. At least that's the intuitive behavior for me. That said, I don't know if such behavior would be useful in practice (i.e. whether there's a use case for it).
On Fri, Nov 5, 2021 at 2:59 AM Jonathan Goble <jcgoble3@gmail.com> wrote:
On Thu, Nov 4, 2021 at 10:37 AM Eric Fahlgren <ericfahlgren@gmail.com> wrote:
On Thu, Nov 4, 2021 at 12:01 AM Ethan Furman <ethan@stoneleaf.us> wrote:
>>> bytearray.fromsize(5, fill=b'\x0a') bytearray(b'\x0a\x0a\x0a\x0a\x0a')
What happens if you supply more than one byte for the fill argument? Silent truncation, raise ValueError('too long') or ???
It would seem reasonable to me for a multi-byte sequence to be filled as-is in a repeating pattern, perhaps truncating the last repetition if len(fill) is not an even multiple of the size. At least that's the intuitive behavior for me.
That said, I don't know if such behavior would be useful in practice (i.e. whether there's a use case for it).
It's definitely useful behaviour, but aligns better with sequence multiplication than a fill= constructor parameter. My expectation (or if you prefer: my preferred shed colour) would be ValueError. ChrisA
The ascii() constructor is not well specified by the PEP. There are only a few examples. I don't understand how it's supposed by be implemented. Would you mind to elaborate its specification? Is it implement "like" ascii(obj).encode("ascii") but with minor changes? What changes? Victor
On 11/8/21 4:45 AM, Victor Stinner wrote:
Is it implement "like" ascii(obj).encode("ascii") but with minor changes? What changes?
It works like `str()`, but you get ascii-encoded bytes (or an exception if that's not possible). The difference with the built-in ascii is the absence of extra quotes and the `b` indicator when a string is used: ```
u_var = u'abc' b_var = b'abc'
str(u_var) 'abc'
str(b_var) "b'abc'"
ascii(b_var) "b'abc'"
b'%a' % (u_var) # the docs will be updated to refer to %a as "ascii-repr" b"'abc'" # as it mirrors %r but only returns ascii-encoded bytes
bytes.ascii(u_var) b'abc'
-- ~Ethan~
On Mon, Nov 8, 2021 at 8:21 PM Ethan Furman <ethan@stoneleaf.us> wrote:
The difference with the built-in ascii is the absence of extra quotes and the `b` indicator when a string is used:
```
u_var = u'abc' bytes.ascii(u_var) b'abc'
What about bytes, bytearray and memoryview? What is the expected behavior? I expect that memoryview is not supported (return something like b'<memory at 0x7fca8602c700>'), and that bytes and bytearray are copied without adding "b" prefix or quotes. bytes.ascii(b'abc') == b'abc' bytes.ascii(bytearray(b'abc')) == b'abc' I just suggest to elaborate the specification in the PEP. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
participants (5)
-
Chris Angelico
-
Eric Fahlgren
-
Ethan Furman
-
Jonathan Goble
-
Victor Stinner