PEP 467: Minor API improvements to bytes, bytearray, and memoryview

Minor changes: updated version numbers, add punctuation. The current text seems to take into account Guido's last comments. Thoughts before asking for acceptance? PEP: 467 Title: Minor API improvements for binary sequences Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan <ncoghlan@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-03-30 Python-Version: 3.5 Post-History: 2014-03-30 2014-08-15 2014-08-16 Abstract ======== During the initial development of the Python 3 language specification, the core ``bytes`` type for arbitrary binary data started as the mutable type that is now referred to as ``bytearray``. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series. This PEP proposes four small adjustments to the APIs of the ``bytes``, ``bytearray`` and ``memoryview`` types to make it easier to operate entirely in the binary domain: * Deprecate passing single integer values to ``bytes`` and ``bytearray`` * Add ``bytes.zeros`` and ``bytearray.zeros`` alternative constructors * Add ``bytes.byte`` and ``bytearray.byte`` alternative constructors * Add ``bytes.iterbytes``, ``bytearray.iterbytes`` and ``memoryview.iterbytes`` alternative iterators Proposals ========= Deprecation of current "zero-initialised sequence" behaviour ------------------------------------------------------------ Currently, the ``bytes`` and ``bytearray`` constructors accept an integer argument and interpret it as meaning to create a zero-initialised sequence of the given size:: >>> bytes(3) b'\x00\x00\x00' >>> bytearray(3) bytearray(b'\x00\x00\x00') This PEP proposes to deprecate that behaviour in Python 3.6, and remove it entirely in Python 3.7. No other changes are proposed to the existing constructors. Addition of explicit "zero-initialised sequence" constructors ------------------------------------------------------------- To replace the deprecated behaviour, this PEP proposes the addition of an explicit ``zeros`` alternative constructor as a class method on both ``bytes`` and ``bytearray``:: >>> bytes.zeros(3) b'\x00\x00\x00' >>> bytearray.zeros(3) bytearray(b'\x00\x00\x00') It will behave just as the current constructors behave when passed a single integer. The specific choice of ``zeros`` as the alternative constructor name is taken from the corresponding initialisation function in NumPy (although, as these are 1-dimensional sequence types rather than N-dimensional matrices, the constructors take a length as input rather than a shape tuple). Addition of explicit "single byte" constructors ----------------------------------------------- As binary counterparts to the text ``chr`` function, this PEP proposes the addition of an explicit ``byte`` alternative constructor as a class method on both ``bytes`` and ``bytearray``:: >>> bytes.byte(3) b'\x03' >>> bytearray.byte(3) bytearray(b'\x03') These methods will only accept integers in the range 0 to 255 (inclusive):: >>> bytes.byte(512) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: bytes must be in range(0, 256) >>> bytes.byte(1.0) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'float' object cannot be interpreted as an integer The documentation of the ``ord`` builtin will be updated to explicitly note that ``bytes.byte`` is the inverse operation for binary data, while ``chr`` is the inverse operation for text data. Behaviourally, ``bytes.byte(x)`` will be equivalent to the current ``bytes([x])`` (and similarly for ``bytearray``). The new spelling is expected to be easier to discover and easier to read (especially when used in conjunction with indexing operations on binary sequence types). As a separate method, the new spelling will also work better with higher order functions like ``map``. Addition of optimised iterator methods that produce ``bytes`` objects --------------------------------------------------------------------- This PEP proposes that ``bytes``, ``bytearray`` and ``memoryview`` gain an optimised ``iterbytes`` method that produces length 1 ``bytes`` objects rather than integers:: for x in data.iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer The method can be used with arbitrary buffer exporting objects by wrapping them in a ``memoryview`` instance first:: for x in memoryview(data).iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer For ``memoryview``, the semantics of ``iterbytes()`` are defined such that:: memview.tobytes() == b''.join(memview.iterbytes()) This allows the raw bytes of the memory view to be iterated over without needing to make a copy, regardless of the defined shape and format. The main advantage this method offers over the ``map(bytes.byte, data)`` approach is that it is guaranteed *not* to fail midstream with a ``ValueError`` or ``TypeError``. By contrast, when using the ``map`` based approach, the type and value of the individual items in the iterable are only checked as they are retrieved and passed through the ``bytes.byte`` constructor. Design discussion ================= Why not rely on sequence repetition to create zero-initialised sequences? ------------------------------------------------------------------------- Zero-initialised sequences can be created via sequence repetition:: >>> b'\x00' * 3 b'\x00\x00\x00' >>> bytearray(b'\x00') * 3 bytearray(b'\x00\x00\x00') However, this was also the case when the ``bytearray`` type was originally designed, and the decision was made to add explicit support for it in the type constructor. The immutable ``bytes`` type then inherited that feature when it was introduced in PEP 3137. This PEP isn't revisiting that original design decision, just changing the spelling as users sometimes find the current behaviour of the binary sequence constructors surprising. In particular, there's a reasonable case to be made that ``bytes(x)`` (where ``x`` is an integer) should behave like the ``bytes.byte(x)`` proposal in this PEP. Providing both behaviours as separate class methods avoids that ambiguity. References ========== .. [1] Initial March 2014 discussion thread on python-ideas (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html) .. [2] Guido's initial feedback in that thread (https://mail.python.org/pipermail/python-ideas/2014-March/027376.html) .. [3] Issue proposing moving zero-initialised sequences to a dedicated API (http://bugs.python.org/issue20895) .. [4] Issue proposing to use calloc() for zero-initialised binary sequences (http://bugs.python.org/issue21644) .. [5] August 2014 discussion thread on python-dev (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html) Copyright ========= This document has been placed in the public domain.

On Jun 07, 2016, at 01:28 PM, Ethan Furman wrote:
* Add ``bytes.iterbytes``, ``bytearray.iterbytes`` and ``memoryview.iterbytes`` alternative iterators
+1 but I want to go just a little farther. We can't change bytes.__getitem__ but we can add another method that returns single byte objects? I think it's still a bit of a pain to extract single bytes even with .iterbytes(). Maybe .iterbytes can take a single index argument (blech) or add a method like .byte_at(i). I'll let you bikeshed on the name. Cheers, -Barry

On Wed, Jun 8, 2016 at 12:57 AM, Barry Warsaw <barry@python.org> wrote:
And if this is called __getitem__ (with slices delegated to bytes.__getitem__) and implemented in a class, one has a view. Maybe I'm missing something, but I fail to understand what makes this significantly more problematic than an iterator. Ok, I guess we might also need __len__. -- Koos

On 7 June 2016 at 15:22, Koos Zevenhoven <k7hoven@gmail.com> wrote:
Right, it's the fact that a view is a much broader API than we need, since most of the operations on the base type are already fine. The two alternate operations that people are interested in are: - like indexing, but producing bytes instead of ints - like iteration, but producing bytes instead of ints That said, it occurs to me that there's a reasonably strong composability argument in favour of a view-based approach: a view will work with operator.itemgetter() and other sequence consuming APIs, while special methods won't. The "like-memoryview-but-not" view type could also take any bytes-like object as input, similar to memoryview itself. Cheers, Nick. P.S. I'm starting to remember why I stopped working on this - I'm genuinely unsure of the right way forward, so I wasn't prepared to advocate strongly for the particular approach in the PEP :) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 08.06.16 02:03, Nick Coghlan wrote:
Something like: class chunks: def __init__(self, seq, size): self._seq = seq self._size = size def __len__(self): return len(self._seq) // self._size def __getitem__(self, i): chunk = self._seq[i: i + self._size] if len(chunk) != self._size: raise IndexError return chunk (but needs more checks and slices support). It would be useful for general sequences too.

On 7 June 2016 at 14:31, Barry Warsaw <barry@python.org> wrote:
Perhaps: data.getbyte(i) data.iterbytes() The rationale for "Why not a live view?" is that an iterator is simple to define and implement, while we know from experience with memoryview and the various dict views that live views are a minefield for folks defining new container types. Since this PEP would in some sense change what it means to implement a full "bytes-like object", it's worth keeping implementation complexity in mind. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

data.getbyte(index_or_slice_object) ? while it might not be... ideal... to create a sliceable live view object, we can have a method that accepts a slice, even if we have to create it manually (or at least make it convenient for those who wish to wrap a bytes object in their own type and blindly pass the first-non-self arg of a custom __getitem__ to the method).

Hello, On Tue, 07 Jun 2016 13:28:13 -0700 Ethan Furman <ethan@stoneleaf.us> wrote:
[]
Why the desire to break applications of thousands and thousands of people? Besides, bytes(3) behavior is very logical. Everyone who knows what malloc(3) does also knows what bytes(3) does. Who doesn't, can learn, and eventually be grateful that learning Python actually helped them to learn other language as well. []
The documentation should probably also mention that bytes.byte(x) is equivalent to x.to_bytes(1, "little"). [] -- Best regards, Paul mailto:pmiscml@gmail.com

On 7 June 2016 at 14:33, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Same argument as any deprecation: to make existing and future defects easier to find or easier to debug. That said, this is the main part I was referring to in the other thread when I mentioned some of the constructor changes were potentially controversial and probably not worth the hassle - it's the only one with the potential to break currently working code, while the others are just a matter of choosing suitable names. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 7 June 2016 at 21:56, Nick Coghlan <ncoghlan@gmail.com> wrote:
An argument against deprecating bytearray(n) in particular is that this is supported in Python 2. I think I have (ab)used this fact to work around the problem with bytes(n) in Python 2 & 3 compatible code.

On 06/07/2016 02:33 PM, Paul Sokolovsky wrote:
Two reasons: 1) bytes are immutable, so creating a 3-byte 0x00 string seems ridiculous; 2) Python is not C, and the vagaries of malloc are not relevant to Python. However, there is little point in breaking working code, so a deprecation without removal is fine by me. -- ~Ethan~

Hello, On Tue, 07 Jun 2016 15:46:00 -0700 Ethan Furman <ethan@stoneleaf.us> wrote:
There's nothing ridiculous in sending N zero bytes over network, writing to a file, transferring to a hardware device. That however raises questions e.g. how to (efficiently) fill a (subsection) of bytearray with something but a 0, and how to apply all that consistently to array.array, but I don't even want to bring it, because the answer will be "we need first to deal with subjects of this PEP".
2) Python is not C, and the vagaries of malloc are not relevant to Python.
Yes, but Python has always had some traits nicely similar to C, (% formatting, os.read/write at the fingertips, this bytes/bytearray constructor, etc.), and that certainly catered for sizable share of its audience. It's nice that nowadays Python is truly multi-paradigm and taught to pre-schools and used by folks who know how to analyze data much better than how to allocate memory to hold that data in the first place. But hopefully people who used Python since 1.x as a nice system-level integration language, concise without much ambiguity (definitely less than other languages, maybe COBOL excluded) shouldn't suffer and have their stuff broken.
However, there is little point in breaking working code, so a deprecation without removal is fine by me.
Thanks.
-- ~Ethan~
-- Best regards, Paul mailto:pmiscml@gmail.com

On Wed, Jun 08, 2016 at 02:17:12AM +0300, Paul Sokolovsky wrote:
I'm not so sure that *thousands* of people are relying on this behaviour, but your point is taken that it is a backwards-incompatible change.
Besides, bytes(3) behavior is very logical. Everyone who knows what malloc(3) does also knows what bytes(3) does.
Most Python coders are not C coders. Knowing C is not and should not be a pre-requisite for using Python.
I really don't think that learning Python will help with C.
True, but there is a good way of writing N identical bytes, not limited to nulls, using the replication operator: py> b'\xff'*10 b'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff' which is more useful than `bytes(10)` since that can only produce zeroes.
Slicing. py> b = bytearray(10) py> b[4:4] = b'\xff'*4 py> b bytearray(b'\x00\x00\x00\x00\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00') -- Steve

On Tue, Jun 7, 2016 at 11:28 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
Why not bytes.viewbytes (or whatever name) so that one could also subscript it? And if it were a property, one could perhaps conveniently get the n'th byte: b'abcde'.viewbytes[n] # compared to b'abcde'[n:n+1] Also, would it not be more clear to call the int -> bytes method something like bytes.fromint or bytes.fromord and introduce the same thing on str? And perhaps allow multiple arguments to create a str/bytes of length > 1. I guess this may violate TOOWTDI, but anyway, just a thought. -- Koos

On 7 June 2016 at 20:28, Ethan Furman <ethan@stoneleaf.us> wrote:
Bytes.byte() is a great idea. But what’s the point or use case of bytearray.byte(), a mutable array of one pre-defined byte?
Might be good to have an example with concrete output, so you see the one-byte strings coming out of it.
tuple(b"ABC".iterbytes()) (b'A', b'B', b'C')

On Jun 08, 2016, at 02:01 AM, Martin Panter wrote:
Bytes.byte() is a great idea. But what’s the point or use case of bytearray.byte(), a mutable array of one pre-defined byte?
I like Bytes.byte() too. I would guess you'd want the same method on bytearray for duck typing APIs. -Barry

On 07.06.16 23:28, Ethan Furman wrote:
"Byte" is an alias to "octet" (8-bit integer) in modern terminology. Iterating bytes and bytearray already produce bytes. Wouldn't this be confused? May be name these methods "iterbytestrings", since they adds str-like behavior?

On 06/07/2016 10:42 PM, Serhiy Storchaka wrote:
On 07.06.16 23:28, Ethan Furman wrote:
Maybe so, but not, to my knowledge, in Python terminology.
Iterating bytes and bytearray already produce bytes.
No, it produces integers:
-- ~Ethan~

Ethan Furman writes:
* Deprecate passing single integer values to ``bytes`` and ``bytearray``
Why? This is a slightly awkward idiom compared to .zeros (EITBI etc), but your 32-bit clock will roll over before we can actually remove it. There are a lot of languages that do this kind of initialization of arrays based on ``count``. If you want to do something useful here, add an optional argument (here in ridiculous :-) generality: bytes(count, tile=[0]) -> bytes(tile * count) where ``tile`` is a Sequence of a type that is acceptable to bytes anyway, or Sequence[int], which is treated as b"".join([bytes(chr(i)) for i in tile] * count]) Interpretation of ``count`` of course i bikesheddable, with at least one alternative interpretation (length of result bytes, with last tile truncated if necessary).
* Add ``bytes.zeros`` and ``bytearray.zeros`` alternative constructors
this is an API break if you take the deprecation as a mandate (which eventual removal does indicate). And backward compatibility for clients of the bytes API means that we violate TOOWTDI indefinitely, on a constructor of quite specialized utility. Yuck. -1 on both. Barry Warsaw writes later in thread:
+1 ISTM that more than the other changes, this is the most important one. Steve

Hi,
I'm opposed to this change (presented like that). Please stop breaking the backward compatibility in minor versions. I'm porting Python 2 code to Python 3 for longer than 2 years. First, Python 3 only proposed to immediatly drop Python 2 support using the 2to3 tool. It simply doesn't work because you must port incrementally all dependencies, so you must write code working with Python 2 and Python 3 using the same code base. A few people tried to duplicate repositories, projects, project name, etc. to have one version for Python 2 and one version for Python 3, but IMHO it's even worse. It's very difficult to handle dependencies using that. It took a few years until six was widely used and that pip was popular enough to be able to add six as a *dependency* (and not put an old copy in the project). Basically, you propose to introduce a backward incompatible change for free (I fail to see the benefit of replacing bytes(n) with bytes.zeros(n)) and without obvious way to write code compatible with Python <= 3.6 and Python >= 3.7. Moreover, a single cycle is way too short to port all code in the wild. It's common that users complain that Python core developers like breaking the compatibility at each release. Recently, I saw a list of applications which need to be ported to Python 3.5, while they work perfectly on Python 3.4. *If* you still want to deprecate bytes(n), you must introduce an helper working on *all* Python versions. Obviously, the helper must be avaialble and work for Python 2.7. Maybe it can be the six module. Maybe something else. In Perl 5, there is a nice "use 5.12;" pragma to explicitly ask to keep the compatiiblity with Perl 5.12. This pragma allows to change the language more easily, since you can port code file by file. I don't know if it's technically possible in Python, maybe not for all kinds of backward incompatible changes. Victor

On 08.06.16 11:04, Victor Stinner wrote:
The argument for deprecating bytes(n) is that this has different meaning in Python 2, and when backport a code to Python 2 or write 2+3 compatible code there is a risk to make a mistake. This argument is not applicable to bytearray(n).
The obvious way to create the bytes object of length n is b'\0' * n. It works in all Python versions starting from 2.6. I don't see the need in bytes(n) and bytes.zeros(n). There are no special methods for creating a list or a string of size n.

Hello, On Wed, 8 Jun 2016 11:53:06 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
That's artifact (as in: defect) of "bytes" (apparently) being a flat alias of "str" in Python2, without trying to validate its arguments. It would be sad if thinkos in Python2 implementation dictate how Python3 should work. It's not too late to fix it in Python2 by issuing s CVE along the lines of "Lack of argument validation in Python2 bytes() constructor may lead to insecure code."
That's very inefficient: it requires allocating useless b'\0', then a generic function to repeat arbitrary memory block N times. If there's a talk of Python to not be laughed at for being SLOW, there would rather be efficient ways to deal with blocks of binary data.
So, above, unless you specifically mean having bytearray.zero() and not having bytes.zero(). But then the whole purpose of the presented PEP is make API more, not less consistent. Having random gaps in bytes vs bytearray API isn't going to help anyone. -- Best regards, Paul mailto:pmiscml@gmail.com

On 08.06.16 13:37, Paul Sokolovsky wrote:
Do you have any evidences for this claim? $ ./python -m timeit -s 'n = 10000' -- 'bytes(n)' 1000000 loops, best of 3: 1.32 usec per loop $ ./python -m timeit -s 'n = 10000' -- 'b"\0" * n' 1000000 loops, best of 3: 0.858 usec per loop

Hello, On Wed, 8 Jun 2016 14:05:19 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
Yes, it's written above, let me repeat it: bytes(n) is (can be) calloc(1, n) underlyingly, while b"\0" * n is a more complex algorithm.
I don't know how inefficient CPython's bytes(n) or how efficient repetition (maybe 1-byte repetitions are optimized into memset()?), but MicroPython (where bytes(n) is truly calloc(n)) gives expected results: $ ./run-bench-tests bench/bytealloc* bench/bytealloc: 3.333s (+00.00%) bench/bytealloc-1-bytes_n.py 11.244s (+237.35%) bench/bytealloc-2-repeat.py -- Best regards, Paul mailto:pmiscml@gmail.com

On 08.06.16 14:26, Paul Sokolovsky wrote:
If the performance of creating an immutable array of n zero bytes is important in MicroPython, it is worth to optimize b"\0" * n. For now CPython is the main implementation of Python 3 and bytes(n) is slower than b"\0" * n in CPython.

Hello, On Wed, 8 Jun 2016 14:45:22 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote: []
No matter how you optimize calloc + something, it's always slower than just calloc.
For now CPython is the main implementation of Python 3
Indeed, and it already has bytes(N). So, perhaps nothing should be done about it except leaving it alone. Perhaps, more discussion should go into whether there's need for .iterbytes() if there's [i:i+1] already. (I personally skip that, as I find [i:i+1] perfectly ok, and while I can't understand how people may be not ok with it up to wanting something more, I leave such possibility).
and bytes(n) is slower than b"\0" * n in CPython.
-- Best regards, Paul mailto:pmiscml@gmail.com

On Jun 8, 2016 8:13 AM, "Paul Sokolovsky" <pmiscml@gmail.com> wrote:
`bytes(n)` *is* calloc + something. It's a lookup of and call to a global function. (Unless MicroPython optimizes away lookups for builtins, in which case it can theoretically optimize b"\0".__mul__.) On the other hand, b"\0" is a constant, and * is an operator lookup that succeeds on the first argument (meaning, perhaps, a successful branch prediction). As a constant, it is only created once, so there's no intermediate object created. AFAICT, the first requires optimizing global function lookups + calls, and the second requires optimizing lookup and *successful* application of __mul__ (versus failure + fallback to some __rmul__), and repetitions of a particular `bytes` object (which can be interned and checked against). That means there is room for either to win, depending on the efforts of the implementers. (However, `bytearray` has no syntax for literals (and therefore easy constants), and is a more valid and, AFAIK, more practical concern.)

On Wed, Jun 08, 2016 at 10:04:08AM +0200, Victor Stinner wrote:
It's common that users complain that Python core developers like breaking the compatibility at each release.
No more common as users complaining that Python features are badly designed and crufty and should be fixed. Whatever we do, we can't win. If we fix misfeatures, people complain. If we don't fix them, people complain. Sometimes the same people, depending on their specific needs. "Fix this, because it annoys me, but don't fix that, because I'm used to it and it doesn't annoy me any more." *shrug* Ultimately it comes down to a subjective feeling as to which is worse. My own subjective feeling is that, in the long run, we'll be better off fixing bytes than keeping it, and the longer we wait to fix it, the harder it will be. -- Steve

On Jun 07, 2016, at 01:28 PM, Ethan Furman wrote:
Does it need to be *actually* removed? That does break existing code for not a lot of benefit. Yes, the default constructor is a little wonky, but with the addition of the new constructors, and the fact that you're not proposing to eventually change the default constructor, removal seems unnecessary. Besides, once it's removed, what would `bytes(3)` actually do? The PEP doesn't say. Also, since you're proposing to add `bytes.byte(3)` have you considered also adding an optional count argument? E.g. `bytes.byte(3, count=7)` would yield b'\x03\x03\x03\x03\x03\x03\x03'. That seems like it could be useful. Cheers, -Barry

On 9 June 2016 at 19:21, Barry Warsaw <barry@python.org> wrote:
Raise TypeError, presumably. However, I agree this isn't worth the hassle of breaking working code, especially since truly ludicrous values will fail promptly with MemoryError - it's only a particular range of values that fit within the limits of the machine, but also push it into heavy swapping that are a potential problem.
The purpose of bytes.byte() in the PEP is to provide a way to roundtrip ord() calls with binary inputs, since the current spelling is pretty unintuitive: >>> ord("A") 65 >>> chr(ord("A")) 'A' >>> ord(b"A") 65 >>> bytes([ord(b"A")]) b'A' That said, perhaps it would make more sense for the corresponding round-trip to be: >>> bchr(ord("A")) b'A' With the "b" prefix on "chr" reflecting the "b" prefix on the output. This also inverts the chr/unichr pairing that existed in Python 2 (replacing it with bchr/chr), and is hence very friendly to compatibility modules like six and future (future.builtins already provides a chr that behaves like the Python 3 one, and bchr would be much easier to add to that than a new bytes object method). In terms of an efficient memory-preallocation interface, the equivalent NumPy operation to request a pre-filled array is "ndarray.full": http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.full.html (there's also an inplace mutation operation, "fill") For bytes and bytearray though, that has an unfortunate name collision with "zfill", which refers to zero-padding numeric values for fixed width display. If the PEP just added bchr() to complement chr(), and [bytes, bytearray].zeros() as a more discoverable alternative to passing integers to the default constructor, I think that would be a decent step forward, and the question of pre-initialising with arbitrary values can be deferred for now (and perhaps left to NumPy indefinitely) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jun 07, 2016, at 01:28 PM, Ethan Furman wrote:
* Add ``bytes.iterbytes``, ``bytearray.iterbytes`` and ``memoryview.iterbytes`` alternative iterators
+1 but I want to go just a little farther. We can't change bytes.__getitem__ but we can add another method that returns single byte objects? I think it's still a bit of a pain to extract single bytes even with .iterbytes(). Maybe .iterbytes can take a single index argument (blech) or add a method like .byte_at(i). I'll let you bikeshed on the name. Cheers, -Barry

On Wed, Jun 8, 2016 at 12:57 AM, Barry Warsaw <barry@python.org> wrote:
And if this is called __getitem__ (with slices delegated to bytes.__getitem__) and implemented in a class, one has a view. Maybe I'm missing something, but I fail to understand what makes this significantly more problematic than an iterator. Ok, I guess we might also need __len__. -- Koos

On 7 June 2016 at 15:22, Koos Zevenhoven <k7hoven@gmail.com> wrote:
Right, it's the fact that a view is a much broader API than we need, since most of the operations on the base type are already fine. The two alternate operations that people are interested in are: - like indexing, but producing bytes instead of ints - like iteration, but producing bytes instead of ints That said, it occurs to me that there's a reasonably strong composability argument in favour of a view-based approach: a view will work with operator.itemgetter() and other sequence consuming APIs, while special methods won't. The "like-memoryview-but-not" view type could also take any bytes-like object as input, similar to memoryview itself. Cheers, Nick. P.S. I'm starting to remember why I stopped working on this - I'm genuinely unsure of the right way forward, so I wasn't prepared to advocate strongly for the particular approach in the PEP :) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 08.06.16 02:03, Nick Coghlan wrote:
Something like: class chunks: def __init__(self, seq, size): self._seq = seq self._size = size def __len__(self): return len(self._seq) // self._size def __getitem__(self, i): chunk = self._seq[i: i + self._size] if len(chunk) != self._size: raise IndexError return chunk (but needs more checks and slices support). It would be useful for general sequences too.

On 7 June 2016 at 14:31, Barry Warsaw <barry@python.org> wrote:
Perhaps: data.getbyte(i) data.iterbytes() The rationale for "Why not a live view?" is that an iterator is simple to define and implement, while we know from experience with memoryview and the various dict views that live views are a minefield for folks defining new container types. Since this PEP would in some sense change what it means to implement a full "bytes-like object", it's worth keeping implementation complexity in mind. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

data.getbyte(index_or_slice_object) ? while it might not be... ideal... to create a sliceable live view object, we can have a method that accepts a slice, even if we have to create it manually (or at least make it convenient for those who wish to wrap a bytes object in their own type and blindly pass the first-non-self arg of a custom __getitem__ to the method).

Hello, On Tue, 07 Jun 2016 13:28:13 -0700 Ethan Furman <ethan@stoneleaf.us> wrote:
[]
Why the desire to break applications of thousands and thousands of people? Besides, bytes(3) behavior is very logical. Everyone who knows what malloc(3) does also knows what bytes(3) does. Who doesn't, can learn, and eventually be grateful that learning Python actually helped them to learn other language as well. []
The documentation should probably also mention that bytes.byte(x) is equivalent to x.to_bytes(1, "little"). [] -- Best regards, Paul mailto:pmiscml@gmail.com

On 7 June 2016 at 14:33, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Same argument as any deprecation: to make existing and future defects easier to find or easier to debug. That said, this is the main part I was referring to in the other thread when I mentioned some of the constructor changes were potentially controversial and probably not worth the hassle - it's the only one with the potential to break currently working code, while the others are just a matter of choosing suitable names. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 7 June 2016 at 21:56, Nick Coghlan <ncoghlan@gmail.com> wrote:
An argument against deprecating bytearray(n) in particular is that this is supported in Python 2. I think I have (ab)used this fact to work around the problem with bytes(n) in Python 2 & 3 compatible code.

On 06/07/2016 02:33 PM, Paul Sokolovsky wrote:
Two reasons: 1) bytes are immutable, so creating a 3-byte 0x00 string seems ridiculous; 2) Python is not C, and the vagaries of malloc are not relevant to Python. However, there is little point in breaking working code, so a deprecation without removal is fine by me. -- ~Ethan~

Hello, On Tue, 07 Jun 2016 15:46:00 -0700 Ethan Furman <ethan@stoneleaf.us> wrote:
There's nothing ridiculous in sending N zero bytes over network, writing to a file, transferring to a hardware device. That however raises questions e.g. how to (efficiently) fill a (subsection) of bytearray with something but a 0, and how to apply all that consistently to array.array, but I don't even want to bring it, because the answer will be "we need first to deal with subjects of this PEP".
2) Python is not C, and the vagaries of malloc are not relevant to Python.
Yes, but Python has always had some traits nicely similar to C, (% formatting, os.read/write at the fingertips, this bytes/bytearray constructor, etc.), and that certainly catered for sizable share of its audience. It's nice that nowadays Python is truly multi-paradigm and taught to pre-schools and used by folks who know how to analyze data much better than how to allocate memory to hold that data in the first place. But hopefully people who used Python since 1.x as a nice system-level integration language, concise without much ambiguity (definitely less than other languages, maybe COBOL excluded) shouldn't suffer and have their stuff broken.
However, there is little point in breaking working code, so a deprecation without removal is fine by me.
Thanks.
-- ~Ethan~
-- Best regards, Paul mailto:pmiscml@gmail.com

On Wed, Jun 08, 2016 at 02:17:12AM +0300, Paul Sokolovsky wrote:
I'm not so sure that *thousands* of people are relying on this behaviour, but your point is taken that it is a backwards-incompatible change.
Besides, bytes(3) behavior is very logical. Everyone who knows what malloc(3) does also knows what bytes(3) does.
Most Python coders are not C coders. Knowing C is not and should not be a pre-requisite for using Python.
I really don't think that learning Python will help with C.
True, but there is a good way of writing N identical bytes, not limited to nulls, using the replication operator: py> b'\xff'*10 b'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff' which is more useful than `bytes(10)` since that can only produce zeroes.
Slicing. py> b = bytearray(10) py> b[4:4] = b'\xff'*4 py> b bytearray(b'\x00\x00\x00\x00\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00') -- Steve

On Tue, Jun 7, 2016 at 11:28 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
Why not bytes.viewbytes (or whatever name) so that one could also subscript it? And if it were a property, one could perhaps conveniently get the n'th byte: b'abcde'.viewbytes[n] # compared to b'abcde'[n:n+1] Also, would it not be more clear to call the int -> bytes method something like bytes.fromint or bytes.fromord and introduce the same thing on str? And perhaps allow multiple arguments to create a str/bytes of length > 1. I guess this may violate TOOWTDI, but anyway, just a thought. -- Koos

On 7 June 2016 at 20:28, Ethan Furman <ethan@stoneleaf.us> wrote:
Bytes.byte() is a great idea. But what’s the point or use case of bytearray.byte(), a mutable array of one pre-defined byte?
Might be good to have an example with concrete output, so you see the one-byte strings coming out of it.
tuple(b"ABC".iterbytes()) (b'A', b'B', b'C')

On Jun 08, 2016, at 02:01 AM, Martin Panter wrote:
Bytes.byte() is a great idea. But what’s the point or use case of bytearray.byte(), a mutable array of one pre-defined byte?
I like Bytes.byte() too. I would guess you'd want the same method on bytearray for duck typing APIs. -Barry

On 07.06.16 23:28, Ethan Furman wrote:
"Byte" is an alias to "octet" (8-bit integer) in modern terminology. Iterating bytes and bytearray already produce bytes. Wouldn't this be confused? May be name these methods "iterbytestrings", since they adds str-like behavior?

On 06/07/2016 10:42 PM, Serhiy Storchaka wrote:
On 07.06.16 23:28, Ethan Furman wrote:
Maybe so, but not, to my knowledge, in Python terminology.
Iterating bytes and bytearray already produce bytes.
No, it produces integers:
-- ~Ethan~

Ethan Furman writes:
* Deprecate passing single integer values to ``bytes`` and ``bytearray``
Why? This is a slightly awkward idiom compared to .zeros (EITBI etc), but your 32-bit clock will roll over before we can actually remove it. There are a lot of languages that do this kind of initialization of arrays based on ``count``. If you want to do something useful here, add an optional argument (here in ridiculous :-) generality: bytes(count, tile=[0]) -> bytes(tile * count) where ``tile`` is a Sequence of a type that is acceptable to bytes anyway, or Sequence[int], which is treated as b"".join([bytes(chr(i)) for i in tile] * count]) Interpretation of ``count`` of course i bikesheddable, with at least one alternative interpretation (length of result bytes, with last tile truncated if necessary).
* Add ``bytes.zeros`` and ``bytearray.zeros`` alternative constructors
this is an API break if you take the deprecation as a mandate (which eventual removal does indicate). And backward compatibility for clients of the bytes API means that we violate TOOWTDI indefinitely, on a constructor of quite specialized utility. Yuck. -1 on both. Barry Warsaw writes later in thread:
+1 ISTM that more than the other changes, this is the most important one. Steve

Hi,
I'm opposed to this change (presented like that). Please stop breaking the backward compatibility in minor versions. I'm porting Python 2 code to Python 3 for longer than 2 years. First, Python 3 only proposed to immediatly drop Python 2 support using the 2to3 tool. It simply doesn't work because you must port incrementally all dependencies, so you must write code working with Python 2 and Python 3 using the same code base. A few people tried to duplicate repositories, projects, project name, etc. to have one version for Python 2 and one version for Python 3, but IMHO it's even worse. It's very difficult to handle dependencies using that. It took a few years until six was widely used and that pip was popular enough to be able to add six as a *dependency* (and not put an old copy in the project). Basically, you propose to introduce a backward incompatible change for free (I fail to see the benefit of replacing bytes(n) with bytes.zeros(n)) and without obvious way to write code compatible with Python <= 3.6 and Python >= 3.7. Moreover, a single cycle is way too short to port all code in the wild. It's common that users complain that Python core developers like breaking the compatibility at each release. Recently, I saw a list of applications which need to be ported to Python 3.5, while they work perfectly on Python 3.4. *If* you still want to deprecate bytes(n), you must introduce an helper working on *all* Python versions. Obviously, the helper must be avaialble and work for Python 2.7. Maybe it can be the six module. Maybe something else. In Perl 5, there is a nice "use 5.12;" pragma to explicitly ask to keep the compatiiblity with Perl 5.12. This pragma allows to change the language more easily, since you can port code file by file. I don't know if it's technically possible in Python, maybe not for all kinds of backward incompatible changes. Victor

On 08.06.16 11:04, Victor Stinner wrote:
The argument for deprecating bytes(n) is that this has different meaning in Python 2, and when backport a code to Python 2 or write 2+3 compatible code there is a risk to make a mistake. This argument is not applicable to bytearray(n).
The obvious way to create the bytes object of length n is b'\0' * n. It works in all Python versions starting from 2.6. I don't see the need in bytes(n) and bytes.zeros(n). There are no special methods for creating a list or a string of size n.

Hello, On Wed, 8 Jun 2016 11:53:06 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
That's artifact (as in: defect) of "bytes" (apparently) being a flat alias of "str" in Python2, without trying to validate its arguments. It would be sad if thinkos in Python2 implementation dictate how Python3 should work. It's not too late to fix it in Python2 by issuing s CVE along the lines of "Lack of argument validation in Python2 bytes() constructor may lead to insecure code."
That's very inefficient: it requires allocating useless b'\0', then a generic function to repeat arbitrary memory block N times. If there's a talk of Python to not be laughed at for being SLOW, there would rather be efficient ways to deal with blocks of binary data.
So, above, unless you specifically mean having bytearray.zero() and not having bytes.zero(). But then the whole purpose of the presented PEP is make API more, not less consistent. Having random gaps in bytes vs bytearray API isn't going to help anyone. -- Best regards, Paul mailto:pmiscml@gmail.com

On 08.06.16 13:37, Paul Sokolovsky wrote:
Do you have any evidences for this claim? $ ./python -m timeit -s 'n = 10000' -- 'bytes(n)' 1000000 loops, best of 3: 1.32 usec per loop $ ./python -m timeit -s 'n = 10000' -- 'b"\0" * n' 1000000 loops, best of 3: 0.858 usec per loop

Hello, On Wed, 8 Jun 2016 14:05:19 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
Yes, it's written above, let me repeat it: bytes(n) is (can be) calloc(1, n) underlyingly, while b"\0" * n is a more complex algorithm.
I don't know how inefficient CPython's bytes(n) or how efficient repetition (maybe 1-byte repetitions are optimized into memset()?), but MicroPython (where bytes(n) is truly calloc(n)) gives expected results: $ ./run-bench-tests bench/bytealloc* bench/bytealloc: 3.333s (+00.00%) bench/bytealloc-1-bytes_n.py 11.244s (+237.35%) bench/bytealloc-2-repeat.py -- Best regards, Paul mailto:pmiscml@gmail.com

On 08.06.16 14:26, Paul Sokolovsky wrote:
If the performance of creating an immutable array of n zero bytes is important in MicroPython, it is worth to optimize b"\0" * n. For now CPython is the main implementation of Python 3 and bytes(n) is slower than b"\0" * n in CPython.

Hello, On Wed, 8 Jun 2016 14:45:22 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote: []
No matter how you optimize calloc + something, it's always slower than just calloc.
For now CPython is the main implementation of Python 3
Indeed, and it already has bytes(N). So, perhaps nothing should be done about it except leaving it alone. Perhaps, more discussion should go into whether there's need for .iterbytes() if there's [i:i+1] already. (I personally skip that, as I find [i:i+1] perfectly ok, and while I can't understand how people may be not ok with it up to wanting something more, I leave such possibility).
and bytes(n) is slower than b"\0" * n in CPython.
-- Best regards, Paul mailto:pmiscml@gmail.com

On Jun 8, 2016 8:13 AM, "Paul Sokolovsky" <pmiscml@gmail.com> wrote:
`bytes(n)` *is* calloc + something. It's a lookup of and call to a global function. (Unless MicroPython optimizes away lookups for builtins, in which case it can theoretically optimize b"\0".__mul__.) On the other hand, b"\0" is a constant, and * is an operator lookup that succeeds on the first argument (meaning, perhaps, a successful branch prediction). As a constant, it is only created once, so there's no intermediate object created. AFAICT, the first requires optimizing global function lookups + calls, and the second requires optimizing lookup and *successful* application of __mul__ (versus failure + fallback to some __rmul__), and repetitions of a particular `bytes` object (which can be interned and checked against). That means there is room for either to win, depending on the efforts of the implementers. (However, `bytearray` has no syntax for literals (and therefore easy constants), and is a more valid and, AFAIK, more practical concern.)

On Wed, Jun 08, 2016 at 10:04:08AM +0200, Victor Stinner wrote:
It's common that users complain that Python core developers like breaking the compatibility at each release.
No more common as users complaining that Python features are badly designed and crufty and should be fixed. Whatever we do, we can't win. If we fix misfeatures, people complain. If we don't fix them, people complain. Sometimes the same people, depending on their specific needs. "Fix this, because it annoys me, but don't fix that, because I'm used to it and it doesn't annoy me any more." *shrug* Ultimately it comes down to a subjective feeling as to which is worse. My own subjective feeling is that, in the long run, we'll be better off fixing bytes than keeping it, and the longer we wait to fix it, the harder it will be. -- Steve

On Jun 07, 2016, at 01:28 PM, Ethan Furman wrote:
Does it need to be *actually* removed? That does break existing code for not a lot of benefit. Yes, the default constructor is a little wonky, but with the addition of the new constructors, and the fact that you're not proposing to eventually change the default constructor, removal seems unnecessary. Besides, once it's removed, what would `bytes(3)` actually do? The PEP doesn't say. Also, since you're proposing to add `bytes.byte(3)` have you considered also adding an optional count argument? E.g. `bytes.byte(3, count=7)` would yield b'\x03\x03\x03\x03\x03\x03\x03'. That seems like it could be useful. Cheers, -Barry

On 9 June 2016 at 19:21, Barry Warsaw <barry@python.org> wrote:
Raise TypeError, presumably. However, I agree this isn't worth the hassle of breaking working code, especially since truly ludicrous values will fail promptly with MemoryError - it's only a particular range of values that fit within the limits of the machine, but also push it into heavy swapping that are a potential problem.
The purpose of bytes.byte() in the PEP is to provide a way to roundtrip ord() calls with binary inputs, since the current spelling is pretty unintuitive: >>> ord("A") 65 >>> chr(ord("A")) 'A' >>> ord(b"A") 65 >>> bytes([ord(b"A")]) b'A' That said, perhaps it would make more sense for the corresponding round-trip to be: >>> bchr(ord("A")) b'A' With the "b" prefix on "chr" reflecting the "b" prefix on the output. This also inverts the chr/unichr pairing that existed in Python 2 (replacing it with bchr/chr), and is hence very friendly to compatibility modules like six and future (future.builtins already provides a chr that behaves like the Python 3 one, and bchr would be much easier to add to that than a new bytes object method). In terms of an efficient memory-preallocation interface, the equivalent NumPy operation to request a pre-filled array is "ndarray.full": http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.full.html (there's also an inplace mutation operation, "fill") For bytes and bytearray though, that has an unfortunate name collision with "zfill", which refers to zero-padding numeric values for fixed width display. If the PEP just added bchr() to complement chr(), and [bytes, bytearray].zeros() as a more discoverable alternative to passing integers to the default constructor, I think that would be a decent step forward, and the question of pre-initialising with arbitrary values can be deferred for now (and perhaps left to NumPy indefinitely) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (13)
-
Barry Warsaw
-
Brett Cannon
-
Ethan Furman
-
Franklin? Lee
-
Koos Zevenhoven
-
Martin Panter
-
Nick Coghlan
-
Paul Sokolovsky
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steven D'Aprano
-
tritium-list@sdamon.com
-
Victor Stinner