PEP 467: Minor API improvements for bytes & bytearray
I just posted an updated version of PEP 467 after recently finishing the updates to the Python 3.4+ binary sequence docs to decouple them from the str docs. Key points in the proposal: * deprecate passing integers to bytes() and bytearray() * add bytes.zeros() and bytearray.zeros() as a replacement * add bytes.byte() and bytearray.byte() as counterparts to ord() for binary data * add bytes.iterbytes(), bytearray.iterbytes() and memoryview.iterbytes() As far as I am aware, that last item poses the only open question, with the alternative being to add an "iterbytes" builtin with a definition along the lines of the following: def iterbytes(data): try: getiter = type(data).__iterbytes__ except AttributeError: iter = map(bytes.byte, data) else: iter = getiter(data) return iter Regards, Nick. PEP URL: http://www.python.org/dev/peps/pep-0467/ Full PEP text: ============================= PEP: 467 Title: Minor API improvements for bytes and bytearray Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan <ncoghlan@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-03-30 Python-Version: 3.5 Post-History: 2014-03-30 2014-08-15 Abstract ======== During the initial development of the Python 3 language specification, the core ``bytes`` type for arbitrary binary data started as the mutable type that is now referred to as ``bytearray``. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series. This PEP proposes a number of small adjustments to the APIs of the ``bytes`` and ``bytearray`` types to make it easier to operate entirely in the binary domain. Background ========== To simplify the task of writing the Python 3 documentation, the ``bytes`` and ``bytearray`` types were documented primarily in terms of the way they differed from the Unicode based Python 3 ``str`` type. Even when I `heavily revised the sequence documentation <http://hg.python.org/cpython/rev/463f52d20314>`__ in 2012, I retained that simplifying shortcut. However, it turns out that this approach to the documentation of these types had a problem: it doesn't adequately introduce users to their hybrid nature, where they can be manipulated *either* as a "sequence of integers" type, *or* as ``str``-like types that assume ASCII compatible data. That oversight has now been corrected, with the binary sequence types now being documented entirely independently of the ``str`` documentation in `Python 3.4+ <https://docs.python.org/3/library/stdtypes.html#binary-sequence-types-bytes-bytearray-memoryview>`__ The confusion isn't just a documentation issue, however, as there are also some lingering design quirks from an earlier pre-release design where there was *no* separate ``bytearray`` type, and instead the core ``bytes`` type was mutable (with no immutable counterpart). Finally, additional experience with using the existing Python 3 binary sequence types in real world applications has suggested it would be beneficial to make it easier to convert integers to length 1 bytes objects. Proposals ========= As a "consistency improvement" proposal, this PEP is actually about a few smaller micro-proposals, each aimed at improving the usability of the binary data model in Python 3. Proposals are motivated by one of two main factors: * removing remnants of the original design of ``bytes`` as a mutable type * allowing users to easily convert integer values to a length 1 ``bytes`` object Alternate Constructors ---------------------- The ``bytes`` and ``bytearray`` constructors currently accept an integer argument, but interpret it to mean a zero-filled object of the given length. This is a legacy of the original design of ``bytes`` as a mutable type, rather than a particularly intuitive behaviour for users. It has become especially confusing now that some other ``bytes`` interfaces treat integers and the corresponding length 1 bytes instances as equivalent input. Compare:: >>> b"\x03" in bytes([1, 2, 3]) True >>> 3 in bytes([1, 2, 3]) True >>> bytes(b"\x03") b'\x03' >>> bytes(3) b'\x00\x00\x00' This PEP proposes that the current handling of integers in the bytes and bytearray constructors by deprecated in Python 3.5 and targeted for removal in Python 3.7, being replaced by two more explicit alternate constructors provided as class methods. The initial python-ideas thread [ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating this constructor behaviour. Firstly, a ``byte`` constructor is proposed that converts integers in the range 0 to 255 (inclusive) to a ``bytes`` object:: >>> bytes.byte(3) b'\x03' >>> bytearray.byte(3) bytearray(b'\x03') >>> bytes.byte(512) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: bytes must be in range(0, 256) One specific use case for this alternate constructor is to easily convert the result of indexing operations on ``bytes`` and other binary sequences from an integer to a ``bytes`` object. The documentation for this API should note that its counterpart for the reverse conversion is ``ord()``. The ``ord()`` documentation will also be updated to note that while ``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and ``bytearray.byte`` are the counterparts for binary input. Secondly, a ``zeros`` constructor is proposed that serves as a direct replacement for the current constructor behaviour, rather than having to use sequence repetition to achieve the same effect in a less intuitive way:: >>> bytes.zeros(3) b'\x00\x00\x00' >>> bytearray.zeros(3) bytearray(b'\x00\x00\x00') The chosen name here is taken from the corresponding initialisation function in NumPy (although, as these are sequence types rather than N-dimensional matrices, the constructors take a length as input rather than a shape tuple) While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more useful duo amongst the new constructors, ``bytes.zeros`` and `bytearray.byte`` are provided in order to maintain API consistency between the two types. Iteration --------- While iteration over ``bytes`` objects and other binary sequences produces integers, it is sometimes desirable to iterate over length 1 bytes objects instead. To handle this situation more obviously (and more efficiently) than would be the case with the ``map(bytes.byte, data)`` construct enabled by the above constructor changes, this PEP proposes the addition of a new ``iterbytes`` method to ``bytes``, ``bytearray`` and ``memoryview``:: for x in data.iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer Third party types and arbitrary containers of integers that lack the new method can still be handled by combining ``map`` with the new ``bytes.byte()`` alternate constructor proposed above:: for x in map(bytes.byte, data): # x is a length 1 ``bytes`` object, rather than an integer # This works with *any* container of integers in the range # 0 to 255 inclusive Open questions ^^^^^^^^^^^^^^ * The fallback case above suggests that this could perhaps be better handled as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()`` if defined, but otherwise fell back to ``map(bytes.byte, data)``:: for x in iterbytes(data): # x is a length 1 ``bytes`` object, rather than an integer # This works with *any* container of integers in the range # 0 to 255 inclusive References ========== .. [ideas-thread1] https://mail.python.org/pipermail/python-ideas/2014-March/027295.html .. [empty-buffer-issue] http://bugs.python.org/issue20895 .. [GvR-initial-feedback] https://mail.python.org/pipermail/python-ideas/2014-March/027376.html Copyright ========= This document has been placed in the public domain. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
This feels chatty. I'd like the PEP to call out the specific proposals and put the more verbose motivation later. It took me a long time to realize that you don't want to deprecate bytes([1, 2, 3]), but only bytes(3). Also your mention of bytes.byte() as the counterpart to ord() confused me -- I think it's more similar to chr(). I don't like iterbytes as a builtin, let's keep it as a method on affected types. On Thu, Aug 14, 2014 at 10:50 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I just posted an updated version of PEP 467 after recently finishing the updates to the Python 3.4+ binary sequence docs to decouple them from the str docs.
Key points in the proposal:
* deprecate passing integers to bytes() and bytearray() * add bytes.zeros() and bytearray.zeros() as a replacement * add bytes.byte() and bytearray.byte() as counterparts to ord() for binary data * add bytes.iterbytes(), bytearray.iterbytes() and memoryview.iterbytes()
As far as I am aware, that last item poses the only open question, with the alternative being to add an "iterbytes" builtin with a definition along the lines of the following:
def iterbytes(data): try: getiter = type(data).__iterbytes__ except AttributeError: iter = map(bytes.byte, data) else: iter = getiter(data) return iter
Regards, Nick.
PEP URL: http://www.python.org/dev/peps/pep-0467/
Full PEP text: ============================= PEP: 467 Title: Minor API improvements for bytes and bytearray Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan <ncoghlan@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-03-30 Python-Version: 3.5 Post-History: 2014-03-30 2014-08-15
Abstract ========
During the initial development of the Python 3 language specification, the core ``bytes`` type for arbitrary binary data started as the mutable type that is now referred to as ``bytearray``. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series.
This PEP proposes a number of small adjustments to the APIs of the ``bytes`` and ``bytearray`` types to make it easier to operate entirely in the binary domain.
Background ==========
To simplify the task of writing the Python 3 documentation, the ``bytes`` and ``bytearray`` types were documented primarily in terms of the way they differed from the Unicode based Python 3 ``str`` type. Even when I `heavily revised the sequence documentation <http://hg.python.org/cpython/rev/463f52d20314>`__ in 2012, I retained that simplifying shortcut.
However, it turns out that this approach to the documentation of these types had a problem: it doesn't adequately introduce users to their hybrid nature, where they can be manipulated *either* as a "sequence of integers" type, *or* as ``str``-like types that assume ASCII compatible data.
That oversight has now been corrected, with the binary sequence types now being documented entirely independently of the ``str`` documentation in `Python 3.4+ < https://docs.python.org/3/library/stdtypes.html#binary-sequence-types-bytes-...
`__
The confusion isn't just a documentation issue, however, as there are also some lingering design quirks from an earlier pre-release design where there was *no* separate ``bytearray`` type, and instead the core ``bytes`` type was mutable (with no immutable counterpart).
Finally, additional experience with using the existing Python 3 binary sequence types in real world applications has suggested it would be beneficial to make it easier to convert integers to length 1 bytes objects.
Proposals =========
As a "consistency improvement" proposal, this PEP is actually about a few smaller micro-proposals, each aimed at improving the usability of the binary data model in Python 3. Proposals are motivated by one of two main factors:
* removing remnants of the original design of ``bytes`` as a mutable type * allowing users to easily convert integer values to a length 1 ``bytes`` object
Alternate Constructors ----------------------
The ``bytes`` and ``bytearray`` constructors currently accept an integer argument, but interpret it to mean a zero-filled object of the given length. This is a legacy of the original design of ``bytes`` as a mutable type, rather than a particularly intuitive behaviour for users. It has become especially confusing now that some other ``bytes`` interfaces treat integers and the corresponding length 1 bytes instances as equivalent input. Compare::
>>> b"\x03" in bytes([1, 2, 3]) True >>> 3 in bytes([1, 2, 3]) True
>>> bytes(b"\x03") b'\x03' >>> bytes(3) b'\x00\x00\x00'
This PEP proposes that the current handling of integers in the bytes and bytearray constructors by deprecated in Python 3.5 and targeted for removal in Python 3.7, being replaced by two more explicit alternate constructors provided as class methods. The initial python-ideas thread [ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating this constructor behaviour.
Firstly, a ``byte`` constructor is proposed that converts integers in the range 0 to 255 (inclusive) to a ``bytes`` object::
>>> bytes.byte(3) b'\x03' >>> bytearray.byte(3) bytearray(b'\x03') >>> bytes.byte(512) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: bytes must be in range(0, 256)
One specific use case for this alternate constructor is to easily convert the result of indexing operations on ``bytes`` and other binary sequences from an integer to a ``bytes`` object. The documentation for this API should note that its counterpart for the reverse conversion is ``ord()``. The ``ord()`` documentation will also be updated to note that while ``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and ``bytearray.byte`` are the counterparts for binary input.
Secondly, a ``zeros`` constructor is proposed that serves as a direct replacement for the current constructor behaviour, rather than having to use sequence repetition to achieve the same effect in a less intuitive way::
>>> bytes.zeros(3) b'\x00\x00\x00' >>> bytearray.zeros(3) bytearray(b'\x00\x00\x00')
The chosen name here is taken from the corresponding initialisation function in NumPy (although, as these are sequence types rather than N-dimensional matrices, the constructors take a length as input rather than a shape tuple)
While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more useful duo amongst the new constructors, ``bytes.zeros`` and `bytearray.byte`` are provided in order to maintain API consistency between the two types.
Iteration ---------
While iteration over ``bytes`` objects and other binary sequences produces integers, it is sometimes desirable to iterate over length 1 bytes objects instead.
To handle this situation more obviously (and more efficiently) than would be the case with the ``map(bytes.byte, data)`` construct enabled by the above constructor changes, this PEP proposes the addition of a new ``iterbytes`` method to ``bytes``, ``bytearray`` and ``memoryview``::
for x in data.iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer
Third party types and arbitrary containers of integers that lack the new method can still be handled by combining ``map`` with the new ``bytes.byte()`` alternate constructor proposed above::
for x in map(bytes.byte, data): # x is a length 1 ``bytes`` object, rather than an integer # This works with *any* container of integers in the range # 0 to 255 inclusive
Open questions ^^^^^^^^^^^^^^
* The fallback case above suggests that this could perhaps be better handled as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()`` if defined, but otherwise fell back to ``map(bytes.byte, data)``::
for x in iterbytes(data): # x is a length 1 ``bytes`` object, rather than an integer # This works with *any* container of integers in the range # 0 to 255 inclusive
References ==========
.. [ideas-thread1] https://mail.python.org/pipermail/python-ideas/2014-March/027295.html .. [empty-buffer-issue] http://bugs.python.org/issue20895 .. [GvR-initial-feedback] https://mail.python.org/pipermail/python-ideas/2014-March/027376.html
Copyright =========
This document has been placed in the public domain.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
On 16 August 2014 03:48, Guido van Rossum <guido@python.org> wrote:
This feels chatty. I'd like the PEP to call out the specific proposals and put the more verbose motivation later.
I realised that some of that history was actually completely irrelevant now, so I culled a fair bit of it entirely.
It took me a long time to realize that you don't want to deprecate bytes([1, 2, 3]), but only bytes(3).
I've split out the four subproposals into their own sections, so hopefully this is clearer now.
Also your mention of bytes.byte() as the counterpart to ord() confused me -- I think it's more similar to chr().
This was just a case of me using the wrong word - I meant "inverse" rather than "counterpart".
I don't like iterbytes as a builtin, let's keep it as a method on affected types.
Done. I also added an explanation of the benefits it offers over the more generic "map(bytes.byte, data)", as well as more precise semantics for how it will work with memoryview objects. New draft is live at http://www.python.org/dev/peps/pep-0467/, as well as being included inline below. Regards, Nick. =================================== PEP: 467 Title: Minor API improvements for bytes and bytearray Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan <ncoghlan@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-03-30 Python-Version: 3.5 Post-History: 2014-03-30 2014-08-15 2014-08-16 Abstract ======== During the initial development of the Python 3 language specification, the core ``bytes`` type for arbitrary binary data started as the mutable type that is now referred to as ``bytearray``. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series. This PEP proposes four small adjustments to the APIs of the ``bytes``, ``bytearray`` and ``memoryview`` types to make it easier to operate entirely in the binary domain: * Deprecate passing single integer values to ``bytes`` and ``bytearray`` * Add ``bytes.zeros`` and ``bytearray.zeros`` alternative constructors * Add ``bytes.byte`` and ``bytearray.byte`` alternative constructors * Add ``bytes.iterbytes``, ``bytearray.iterbytes`` and ``memoryview.iterbytes`` alternative iterators Proposals ========= Deprecation of current "zero-initialised sequence" behaviour ------------------------------------------------------------ Currently, the ``bytes`` and ``bytearray`` constructors accept an integer argument and interpret it as meaning to create a zero-initialised sequence of the given size:: >>> bytes(3) b'\x00\x00\x00' >>> bytearray(3) bytearray(b'\x00\x00\x00') This PEP proposes to deprecate that behaviour in Python 3.5, and remove it entirely in Python 3.6. No other changes are proposed to the existing constructors. Addition of explicit "zero-initialised sequence" constructors ------------------------------------------------------------- To replace the deprecated behaviour, this PEP proposes the addition of an explicit ``zeros`` alternative constructor as a class method on both ``bytes`` and ``bytearray``:: >>> bytes.zeros(3) b'\x00\x00\x00' >>> bytearray.zeros(3) bytearray(b'\x00\x00\x00') It will behave just as the current constructors behave when passed a single integer. The specific choice of ``zeros`` as the alternative constructor name is taken from the corresponding initialisation function in NumPy (although, as these are 1-dimensional sequence types rather than N-dimensional matrices, the constructors take a length as input rather than a shape tuple) Addition of explicit "single byte" constructors ----------------------------------------------- As binary counterparts to the text ``chr`` function, this PEP proposes the addition of an explicit ``byte`` alternative constructor as a class method on both ``bytes`` and ``bytearray``:: >>> bytes.byte(3) b'\x03' >>> bytearray.byte(3) bytearray(b'\x03') These methods will only accept integers in the range 0 to 255 (inclusive):: >>> bytes.byte(512) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: bytes must be in range(0, 256) >>> bytes.byte(1.0) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'float' object cannot be interpreted as an integer The documentation of the ``ord`` builtin will be updated to explicitly note that ``bytes.byte`` is the inverse operation for binary data, while ``chr`` is the inverse operation for text data. Behaviourally, ``bytes.byte(x)`` will be equivalent to the current ``bytes([x])`` (and similarly for ``bytearray``). The new spelling is expected to be easier to discover and easier to read (especially when used in conjunction with indexing operations on binary sequence types). As a separate method, the new spelling will also work better with higher order functions like ``map``. Addition of optimised iterator methods that produce ``bytes`` objects --------------------------------------------------------------------- This PEP proposes that ``bytes``, ``bytearray`` and ``memoryview`` gain an optimised ``iterbytes`` method that produces length 1 ``bytes`` objects rather than integers:: for x in data.iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer The method can be used with arbitrary buffer exporting objects by wrapping them in a ``memoryview`` instance first:: for x in memoryview(data).iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer For ``memoryview``, the semantics of ``iterbytes()`` are defined such that:: memview.tobytes() == b''.join(memview.iterbytes()) This allows the raw bytes of the memory view to be iterated over without needing to make a copy, regardless of the defined shape and format. The main advantage this method offers over the ``map(bytes.byte, data)`` approach is that it is guaranteed *not* to fail midstream with a ``ValueError`` or ``TypeError``. By contrast, when using the ``map`` based approach, the type and value of the individual items in the iterable are only checked as they are retrieved and passed through the ``bytes.byte`` constructor. Design discussion ================= Why not rely on sequence repetition to create zero-initialised sequences? ------------------------------------------------------------------------- Zero-initialised sequences can be created via sequence repetition:: >>> b'\x00' * 3 b'\x00\x00\x00' >>> bytearray(b'\x00') * 3 bytearray(b'\x00\x00\x00') However, this was also the case when the ``bytearray`` type was originally designed, and the decision was made to add explicit support for it in the type constructor. The immutable ``bytes`` type then inherited that feature when it was introduced in PEP 3137. This PEP isn't revisiting that original design decision, just changing the spelling as users sometimes find the current behaviour of the binary sequence constructors surprising. In particular, there's a reasonable case to be made that ``bytes(x)`` (where ``x`` is an integer) should behave like the ``bytes.byte(x)`` proposal in this PEP. Providing both behaviours as separate class methods avoids that ambiguity. References ========== .. [1] Initial March 2014 discussion thread on python-ideas (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html) .. [2] Guido's initial feedback in that thread (https://mail.python.org/pipermail/python-ideas/2014-March/027376.html) .. [3] Issue proposing moving zero-initialised sequences to a dedicated API (http://bugs.python.org/issue20895) .. [4] Issue proposing to use calloc() for zero-initialised binary sequences (http://bugs.python.org/issue21644) .. [5] August 2014 discussion thread on python-dev (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Le 16/08/2014 01:17, Nick Coghlan a écrit :
* Deprecate passing single integer values to ``bytes`` and ``bytearray``
I'm neutral. Ideally we wouldn't have done that mistake at the beginning.
* Add ``bytes.zeros`` and ``bytearray.zeros`` alternative constructors * Add ``bytes.byte`` and ``bytearray.byte`` alternative constructors * Add ``bytes.iterbytes``, ``bytearray.iterbytes`` and ``memoryview.iterbytes`` alternative iterators
+0.5. "iterbytes" isn't really great as a name. Regards Antoine.
15.08.14 08:50, Nick Coghlan написав(ла):
* add bytes.zeros() and bytearray.zeros() as a replacement
b'\0' * n and bytearray(b'\0') * n look good replacements to me. No need to learn new method. And it works right now.
* add bytes.iterbytes(), bytearray.iterbytes() and memoryview.iterbytes()
What are use cases for this? I suppose that main use case may be writing the code compatible with 2.7 and 3.x. But in this case you need a wrapper (because these types in 2.7 have no the iterbytes() method). And how larger would be an advantage of this method over the ``map(bytes.byte, data)``?
2014-08-15 21:54 GMT+02:00 Serhiy Storchaka <storchaka@gmail.com>:
15.08.14 08:50, Nick Coghlan написав(ла):
* add bytes.zeros() and bytearray.zeros() as a replacement
b'\0' * n and bytearray(b'\0') * n look good replacements to me. No need to learn new method. And it works right now.
FYI there is a pending patch for bytearray(int) to use calloc() instead of malloc(). It's faster for buffer for n larger than 1 MB: http://bugs.python.org/issue21644 I'm not sure that the optimization is really useful. Victor
2014-08-15 7:50 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
As far as I am aware, that last item poses the only open question, with the alternative being to add an "iterbytes" builtin (...)
Do you have examples of use cases for a builtin function? I only found 5 usages of bytes((byte,)) constructor in the standard library: $ grep -E 'bytes\(\([^)]+, *\)\)' $(find -name "*.py") ./Lib/quopri.py: c = bytes((c,)) ./Lib/quopri.py: c = bytes((c,)) ./Lib/base64.py: b32tab = [bytes((i,)) for i in _b32alphabet] ./Lib/base64.py: _a85chars = [bytes((i,)) for i in range(33, 118)] ./Lib/base64.py: _b85chars = [bytes((i,)) for i in _b85alphabet] bytes.iterbytes() can be used in 4 cases on 5. Adding a new builtin for a single line in the whole standard library doesn't look right. Victor
On Aug 14, 2014, at 10:50 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Key points in the proposal:
* deprecate passing integers to bytes() and bytearray()
I'm opposed to removing this part of the API. It has proven useful and the alternative isn't very nice. Declaring the size of fixed length arrays is not a new concept and is widely adopted in other languages. One principal use case for the bytearray is creating and manipulating binary data. Initializing to zero is common operation and should remain part of the core API (consider why we now have list.copy() even though copying with a slice remains possible and efficient). I and my clients have taken advantage of this feature and it reads nicely. The proposed deprecation would break our code and not actually make anything better. Another thought is that the core devs should be very reluctant to deprecate anything we don't have to while the 2 to 3 transition is still in progress. Every new deprecation of APIs that existed in Python 2.7 just adds another obstacle to converting code. Individually, the differences are trivial. Collectively, they present a good reason to never migrate code to Python 3. Raymond
On 17 August 2014 18:13, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
On Aug 14, 2014, at 10:50 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Key points in the proposal:
* deprecate passing integers to bytes() and bytearray()
I'm opposed to removing this part of the API. It has proven useful and the alternative isn't very nice. Declaring the size of fixed length arrays is not a new concept and is widely adopted in other languages. One principal use case for the bytearray is creating and manipulating binary data. Initializing to zero is common operation and should remain part of the core API (consider why we now have list.copy() even though copying with a slice remains possible and efficient).
That's why the PEP proposes adding a "zeros" method, based on the name of the corresponding NumPy construct. The status quo has some very ugly failure modes when an integer is passed unexpectedly, and tries to create a large buffer, rather than throwing a type error.
I and my clients have taken advantage of this feature and it reads nicely.
If I see "bytearray(10)" there is nothing there that suggests "this creates an array of length 10 and initialises it to zero" to me. I'd be more inclined to guess it would be equivalent to "bytearray([10])". "bytearray.zeros(10)", on the other hand, is relatively clear, independently of user expectations.
The proposed deprecation would break our code and not actually make anything better.
Another thought is that the core devs should be very reluctant to deprecate anything we don't have to while the 2 to 3 transition is still in progress. Every new deprecation of APIs that existed in Python 2.7 just adds another obstacle to converting code. Individually, the differences are trivial. Collectively, they present a good reason to never migrate code to Python 3.
This is actually one of the inconsistencies between the Python 2 and 3 binary APIs: Python 2.7.5 (default, Jun 25 2014, 10:19:55) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
bytes(10) '10' bytearray(10) bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
Users wanting well-behaved binary sequences in Python 2.7 would be well advised to use the "future" module to get a full backport of the actual Python 3 bytes type, rather than the approximation that is the 8-bit str in Python 2. And once they do that, they'll be able to track the evolution of the Python 3 binary sequence behaviour without any further trouble. That said, I don't really mind how long the deprecation cycle is. I'd be fine with fully supporting both in 3.5 (2015), deprecating the main constructor in favour of the explicit zeros() method in 3.6 (2017) and dropping the legacy behaviour in 3.7 (2018) Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Aug 17, 2014, at 1:41 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
If I see "bytearray(10)" there is nothing there that suggests "this creates an array of length 10 and initialises it to zero" to me. I'd be more inclined to guess it would be equivalent to "bytearray([10])".
"bytearray.zeros(10)", on the other hand, is relatively clear, independently of user expectations.
Zeros would have been great but that should have been done originally. The time to get API design right is at inception. Now, you're just breaking code and invalidating any published examples.
Another thought is that the core devs should be very reluctant to deprecate anything we don't have to while the 2 to 3 transition is still in progress. Every new deprecation of APIs that existed in Python 2.7 just adds another obstacle to converting code. Individually, the differences are trivial. Collectively, they present a good reason to never migrate code to Python 3.
This is actually one of the inconsistencies between the Python 2 and 3 binary APIs:
However, bytearray(n) is the same in both Python 2 and Python 3. Changing it in Python 3 increases the gulf between the two. The further we let Python 3 diverge from Python 2, the less likely that people will convert their code and the harder you make it to write code that runs under both. FWIW, I've been teaching Python full time for three years. I cover the use of bytearray(n) in my classes and not a single person out of 3000+ engineers have had a problem with it. I seriously question the PEP's assertion that there is a real problem to be solved (i.e. that people are baffled by bytearray(bufsiz)) and that the problem is sufficiently painful to warrant the headaches that go along with API changes. The other proposal to add bytearray.byte(3) should probably be named bytearray.from_byte(3) for clarity. That said, I question whether there is actually a use case for this. I have never seen seen code that has a need to create a byte array of length one from a single integer. For the most part, the API will be easiest to learn if it matches what we do for lists and for array.array. Sorry Nick, but I think you're making the API worse instead of better. This API isn't perfect but it isn't flat-out broken either. There is some unfortunate asymmetry between bytes() and bytearray() in Python 2, but that ship has sailed. The current API for Python 3 is pretty good (though there is still a tension between wanting to be like lists and like strings both at the same time). Raymond P.S. The most important problem in the Python world now is getting Python 2 users to adopt Python 3. The core devs need to develop a strong distaste for anything that makes that problem harder.
On Aug 17, 2014, at 1:07 PM, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
On Aug 17, 2014, at 1:41 AM, Nick Coghlan <ncoghlan@gmail.com <mailto:ncoghlan@gmail.com>> wrote:
If I see "bytearray(10)" there is nothing there that suggests "this creates an array of length 10 and initialises it to zero" to me. I'd be more inclined to guess it would be equivalent to "bytearray([10])".
"bytearray.zeros(10)", on the other hand, is relatively clear, independently of user expectations.
Zeros would have been great but that should have been done originally. The time to get API design right is at inception. Now, you're just breaking code and invalidating any published examples.
Another thought is that the core devs should be very reluctant to deprecate anything we don't have to while the 2 to 3 transition is still in progress. Every new deprecation of APIs that existed in Python 2.7 just adds another obstacle to converting code. Individually, the differences are trivial. Collectively, they present a good reason to never migrate code to Python 3.
This is actually one of the inconsistencies between the Python 2 and 3 binary APIs:
However, bytearray(n) is the same in both Python 2 and Python 3. Changing it in Python 3 increases the gulf between the two.
The further we let Python 3 diverge from Python 2, the less likely that people will convert their code and the harder you make it to write code that runs under both.
FWIW, I've been teaching Python full time for three years. I cover the use of bytearray(n) in my classes and not a single person out of 3000+ engineers have had a problem with it. I seriously question the PEP's assertion that there is a real problem to be solved (i.e. that people are baffled by bytearray(bufsiz)) and that the problem is sufficiently painful to warrant the headaches that go along with API changes.
The other proposal to add bytearray.byte(3) should probably be named bytearray.from_byte(3) for clarity. That said, I question whether there is actually a use case for this. I have never seen seen code that has a need to create a byte array of length one from a single integer. For the most part, the API will be easiest to learn if it matches what we do for lists and for array.array.
Sorry Nick, but I think you're making the API worse instead of better. This API isn't perfect but it isn't flat-out broken either. There is some unfortunate asymmetry between bytes() and bytearray() in Python 2, but that ship has sailed. The current API for Python 3 is pretty good (though there is still a tension between wanting to be like lists and like strings both at the same time).
Raymond
P.S. The most important problem in the Python world now is getting Python 2 users to adopt Python 3. The core devs need to develop a strong distaste for anything that makes that problem harder.
For the record I’ve had all of the problems that Nick states and I’m +1 on this change. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On Aug 17, 2014, at 11:33 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
I've had many of the problems Nick states and I'm also +1.
There are two code snippets below which were taken from the standard library. Are you saying that: 1) you don't understand the code (as the pep suggests) 2) you are willing to break that code and everything like it 3) and it would be more elegantly expressed as: charmap = bytearray.zeros(256) and mapping = bytearray.zeros(256) At work, I have network engineers creating IPv4 headers and other structures with bytearrays initialized to zeros. Do you really want to break all their code? No where else in Python do we create buffers that way. Code like "msg, who = s.recvfrom(256)" is the norm. Also, it is unclear if you're saying that you have an actual use case for this part of the proposal? ba = bytearray.byte(65) And than the code would be better, clearer, and faster than the currently working form? ba = bytearray([65]) Does there really need to be a special case for constructing a single byte? To me, that is akin to proposing "list.from_int(65)" as an important special case to replace "[65]". If you must muck with the ever changing bytes() API, then please leave the bytearray() API alone. I think we should show some respect for code that is currently working and is cleanly expressible in both Python 2 and Python 3. We aren't winning users with API churn. FWIW, I guessing that the differing view points in the thread stem mainly from the proponents experiences with bytes() rather than from experience with bytearray() which doesn't seem to have any usage problems in the wild. I've never seen a developer say they didn't understand what "buf = bytearray(1024)" means. That is not an actual problem that needs solving (or breaking). What may be an actual problem is code like "char = bytes(1024)" though I'm unclear what a user might have actually been trying to do with code like that. Raymond ----------- excerpts from Lib/sre_compile.py --------------- charmap = bytearray(256) for op, av in charset: while True: try: if op is LITERAL: charmap[fixup(av)] = 1 elif op is RANGE: for i in range(fixup(av[0]), fixup(av[1])+1): charmap[i] = 1 elif op is NEGATE: out.append((op, av)) else: tail.append((op, av)) ... charmap = bytes(charmap) # should be hashable comps = {} mapping = bytearray(256) block = 0 data = bytearray() for i in range(0, 65536, 256): chunk = charmap[i: i + 256] if chunk in comps: mapping[i // 256] = comps[chunk] else: mapping[i // 256] = comps[chunk] = block block += 1 data += chunk data = _mk_bitmap(data) data[0:0] = [block] + _bytes_to_codes(mapping) out.append((BIGCHARSET, data)) out += tail return out
On Aug 17, 2014, at 5:19 PM, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
On Aug 17, 2014, at 11:33 AM, Ethan Furman <ethan@stoneleaf.us <mailto:ethan@stoneleaf.us>> wrote:
I've had many of the problems Nick states and I'm also +1.
There are two code snippets below which were taken from the standard library. Are you saying that: 1) you don't understand the code (as the pep suggests) 2) you are willing to break that code and everything like it 3) and it would be more elegantly expressed as: charmap = bytearray.zeros(256) and mapping = bytearray.zeros(256)
At work, I have network engineers creating IPv4 headers and other structures with bytearrays initialized to zeros. Do you really want to break all their code? No where else in Python do we create buffers that way. Code like "msg, who = s.recvfrom(256)" is the norm.
Also, it is unclear if you're saying that you have an actual use case for this part of the proposal?
ba = bytearray.byte(65)
And than the code would be better, clearer, and faster than the currently working form?
ba = bytearray([65])
Does there really need to be a special case for constructing a single byte? To me, that is akin to proposing "list.from_int(65)" as an important special case to replace "[65]".
If you must muck with the ever changing bytes() API, then please leave the bytearray() API alone. I think we should show some respect for code that is currently working and is cleanly expressible in both Python 2 and Python 3. We aren't winning users with API churn.
FWIW, I guessing that the differing view points in the thread stem mainly from the proponents experiences with bytes() rather than from experience with bytearray() which doesn't seem to have any usage problems in the wild. I've never seen a developer say they didn't understand what "buf = bytearray(1024)" means. That is not an actual problem that needs solving (or breaking).
What may be an actual problem is code like "char = bytes(1024)" though I'm unclear what a user might have actually been trying to do with code like that.
I think this is probably correct. I generally don’t think that bytes(1024) makes much sense at all, especially not as a default constructor. Most likely it exists to be similar to bytearray(). I don't have a specific problem with bytearray(1024), though I do think it's more elegantly and clearly described as bytearray.zeros(1024), but not by much. I find bytes.byte()/bytearray to be needed as long as there isn't a simple way to iterate over a bytes or bytearray in a way that yields bytes or bytearrays instead of integers. To be honest I can't think of a time when I'd actually *want* to iterate over a bytes/bytearray as integers. Although I realize there is unlikely to be a reasonable method to change that now. If iterbytes is added I'm not sure where i'd personally use either bytes.byte() or bytearray.byte(). In general though I think that overloading a single constructor method to do something conceptually different based on the type of the parameter leads to these kind of confusing scenarios and that having differently named constructors for the different concepts is far clearer. So given all that, I am: * +10000 for some method of iterating over both types as bytes instead of integers. * +1 on adding .zeros to both types as an alternative and preferred method of creating a zero filled instance and deprecating the original method[1]. * -0 on adding .byte to both types as an alternative method of creating a single byte instance. * -1 On changing the meaning of bytearray(1024). * +/-0 on changing the meaning of bytes(1024), I think that bytes(1024) is likely to *not* be what someone wants and that what they really want is bytes([N]). I also think that the number one reason for someone to be doing bytes(N) is because they were attempting to iterate over a bytes or bytearray object and they got an integer. I also think that it's bad that this changes from 2.x to 3.x and I wish it hadn't. However I can't decide if it's worth reverting this at this time or not. [1] By deprecating I mean, raise a deprecation warning, or something but my thoughts on actually removing the other methods are listed explicitly. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On 08/17/2014 02:19 PM, Raymond Hettinger wrote:
On Aug 17, 2014, at 11:33 AM, Ethan Furman wrote:
I've had many of the problems Nick states and I'm also +1.
There are two code snippets below which were taken from the standard library.
[...] My issues are with 'bytes', not 'bytearray'. 'bytearray(10)' actually makes sense. I certainly have no problem with bytearray and bytes not being exactly the same. My primary issues with bytes is not being able to do b'abc'[2] == b'c', and with not being able to do x = b'abc'[2]; y = bytes(x); assert y == b'c'. And because of the backwards compatibility issues I would deprecate, because we have a new 'better' way, but not remove, the current functionality. I pretty much agree exactly with what Donald Stufft said about it. -- ~Ethan~
Donald Stufft <donald <at> stufft.io> writes:
For the record I’ve had all of the problems that Nick states and I’m +1 on this change.
--- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
I've hit basically every problem everyone here has stated, and in no uncertain terms am I completely opposed to deprecating anything. The Python 2 to 3 migration is already hard enough, and already proceeding far too slowly for many of our tastes. Making that migration even more complex would drive me to the point of giving up. Alex
On Sun, Aug 17, 2014 at 7:14 PM, Alex Gaynor <alex.gaynor@gmail.com> wrote:
I've hit basically every problem everyone here has stated, and in no uncertain terms am I completely opposed to deprecating anything. The Python 2 to 3 migration is already hard enough, and already proceeding far too slowly for many of our tastes. Making that migration even more complex would drive me to the point of giving up.
Could you elaborate what problems you are thinking this will cause for you? It seems to me that avoiding a bug-prone API is not particularly complex, and moving it back to its 2.x semantics or making it not work entirely, rather than making it work differently, would make porting applications easier. If, during porting to 3.x, you find a deprecation warning for bytes(n), then rather than being annoying code churny extra changes, this is actually a bug that's been identified. So it's helpful even during the deprecation period. -- Devin
Le 17/08/2014 13:07, Raymond Hettinger a écrit :
FWIW, I've been teaching Python full time for three years. I cover the use of bytearray(n) in my classes and not a single person out of 3000+ engineers have had a problem with it.
This is less about bytearray() than bytes(), IMO. bytearray() is sufficiently specialized that only experienced people will encounter it. And while preallocating a bytearray of a certain size makes sense, it's completely pointless for a bytes object. Regards Antoine.
On 18 Aug 2014 03:07, "Raymond Hettinger" <raymond.hettinger@gmail.com> wrote:
On Aug 17, 2014, at 1:41 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
If I see "bytearray(10)" there is nothing there that suggests "this creates an array of length 10 and initialises it to zero" to me. I'd be more inclined to guess it would be equivalent to "bytearray([10])".
"bytearray.zeros(10)", on the other hand, is relatively clear, independently of user expectations.
Zeros would have been great but that should have been done originally. The time to get API design right is at inception. Now, you're just breaking code and invalidating any published examples.
I'm fine with postponing the deprecation elements indefinitely (or just deprecating bytes(int) and leaving bytearray(int) alone).
Another thought is that the core devs should be very reluctant to
anything we don't have to while the 2 to 3 transition is still in
deprecate progress.
Every new deprecation of APIs that existed in Python 2.7 just adds another obstacle to converting code. Individually, the differences are trivial. Collectively, they present a good reason to never migrate code to Python 3.
This is actually one of the inconsistencies between the Python 2 and 3 binary APIs:
However, bytearray(n) is the same in both Python 2 and Python 3. Changing it in Python 3 increases the gulf between the two.
The further we let Python 3 diverge from Python 2, the less likely that people will convert their code and the harder you make it to write code that runs under both.
FWIW, I've been teaching Python full time for three years. I cover the use of bytearray(n) in my classes and not a single person out of 3000+ engineers have had a problem with it. I seriously question the PEP's assertion that there is a real problem to be solved (i.e. that people are baffled by bytearray(bufsiz)) and that the problem is sufficiently painful to warrant the headaches that go along with API changes.
Yes, I'd expect engineers and networking folks to be fine with it. It isn't how this mode of the constructor *works* that worries me, it's how it *fails* (i.e. silently producing unexpected data rather than a type error). Purely deprecating the bytes case and leaving bytearray alone would likely address my concerns.
The other proposal to add bytearray.byte(3) should probably be named bytearray.from_byte(3) for clarity. That said, I question whether there
is
actually a use case for this. I have never seen seen code that has a need to create a byte array of length one from a single integer. For the most part, the API will be easiest to learn if it matches what we do for lists and for array.array.
This part of the proposal came from a few things: * many of the bytes and bytearray methods only accept bytes-like objects, but iteration and indexing produce integers * to mitigate the impact of the above, some (but not all) bytes and bytearray methods now accept integers in addition to bytes-like objects * ord() in Python 3 is only documented as accepting length 1 strings, but also accepts length 1 bytes-like objects Adding bytes.byte() makes it practical to document the binary half of ord's behaviour, and eliminates any temptation to expand the "also accepts integers" behaviour out to more types. bytes.byte() thus becomes the binary equivalent of chr(), just as Python 2 had both chr() and unichr(). I don't recall ever needing chr() in a real program either, but I still consider it an important part of clearly articulating the data model.
Sorry Nick, but I think you're making the API worse instead of better. This API isn't perfect but it isn't flat-out broken either. There is some unfortunate asymmetry between bytes() and bytearray() in Python 2, but that ship has sailed. The current API for Python 3 is pretty good (though there is still a tension between wanting to be like lists and like strings both at the same time).
Yes. It didn't help that the docs previously expected readers to infer the behaviour of the binary sequence methods from the string documentation - while the new docs could still use some refinement, I've at least addressed that part of the problem. Cheers, Nick.
On Aug 17, 2014, at 4:08 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Purely deprecating the bytes case and leaving bytearray alone would likely address my concerns.
That is good progress. Thanks :-) Would a warning for the bytes case suffice, do you need an actual deprecation?
bytes.byte() thus becomes the binary equivalent of chr(), just as Python 2 had both chr() and unichr().
I don't recall ever needing chr() in a real program either, but I still consider it an important part of clearly articulating the data model.
"I don't recall having ever needed this" greatly weakens the premise that this is needed :-) The APIs have been around since 2.6 and AFAICT there have been zero demonstrated need for a special case for a single byte. We already have a perfectly good spelling: NUL = bytes([0]) The Zen tells us we really don't need a second way to do it (actually a third since you can also write b'\x00') and it suggests that this special case isn't special enough. I encourage restraint against adding an unneeded class method that has no parallel elsewhere. Right now, the learning curve is mitigated because bytes is very str-like and because bytearray is list-like (i.e. the method names have been used elsewhere and likely already learned before encountering bytes() or bytearray()). Putting in new, rarely used funky method adds to the learning burden. If you do press forward with adding it (and I don't see why), then as an alternate constructor, the name should be from_int() or some such to avoid ambiguity and to make clear that it is a class method.
iterbytes() isn't especially attractive as a method name, but it's far more explicit about its purpose.
I concur. In this case, explicitness matters. Raymond
On 18 Aug 2014 09:41, "Raymond Hettinger" <raymond.hettinger@gmail.com> wrote:
I encourage restraint against adding an unneeded class method that has no
parallel
elsewhere. Right now, the learning curve is mitigated because bytes is very str-like and because bytearray is list-like (i.e. the method names have been used elsewhere and likely already learned before encountering bytes() or bytearray()). Putting in new, rarely used funky method adds to the learning burden.
If you do press forward with adding it (and I don't see why), then as an alternate constructor, the name should be from_int() or some such to avoid ambiguity and to make clear that it is a class method.
If I remember the sequence of events correctly, I thought of map(bytes.byte, data) first, and then Guido suggested a dedicated iterbytes() method later. The step I hadn't taken (until now) was realising that the new memoryview(data).iterbytes() capability actually combines with the existing (bytes([b]) for b in data) to make the original bytes.byte idea unnecessary. Cheers, Nick.
Le 17/08/2014 19:41, Raymond Hettinger a écrit :
The APIs have been around since 2.6 and AFAICT there have been zero demonstrated need for a special case for a single byte. We already have a perfectly good spelling: NUL = bytes([0])
That is actually a very cumbersome spelling. Why should I first create a one-element list in order to create a one-byte bytes object?
The Zen tells us we really don't need a second way to do it (actually a third since you can also write b'\x00') and it suggests that this special case isn't special enough.
b'\x00' is obviously the right way to do it in this case, but we're concerned about the non-constant case. The reason to instantiate bytes from non-constant integer comes from the unfortunate indexing and iteration behaviour of bytes objects. Regards Antoine.
On Aug 17, 2014, at 09:39 PM, Antoine Pitrou wrote:
need for a special case for a single byte. We already have a perfectly good spelling: NUL = bytes([0])
That is actually a very cumbersome spelling. Why should I first create a one-element list in order to create a one-byte bytes object?
I feel the same way every time I have to write `set(['foo'])`. -Barry
On Sun, Aug 17, 2014 at 8:52 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 08/17/2014 04:08 PM, Nick Coghlan wrote:
I'm fine with postponing the deprecation elements indefinitely (or just deprecating bytes(int) and leaving bytearray(int) alone).
+1 on both pieces.
Perhaps postpone the deprecation to Python 4000 ;)
participants (12)
-
Alex Gaynor
-
Antoine Pitrou
-
Barry Warsaw
-
Devin Jeanpierre
-
Donald Stufft
-
Ethan Furman
-
Guido van Rossum
-
Ian Cordasco
-
Nick Coghlan
-
Raymond Hettinger
-
Serhiy Storchaka
-
Victor Stinner