Fixing the Python 3 bytes constructor

One of the current annoyances with the bytes type in Python 3 is the way the constructor handles integers:
bytes(3) b'\x00\x00\x00'
It would be far more consistent with the behaviour of other bytes interfaces if the result of that call was instead b'\x03'. Instead, to get that behaviour, you currently have to wrap it in a list or other iterable:
bytes([3]) b'\x03'
The other consequence of this is that's currently no neat way to convert the integers produced by various bytes APIs back to a length one bytes object - we have no binary equivalent of "chr" to convert an integer in the range 0-255 inclusive to its bytes counterpart. The acceptance of PEP 361 means we'll get another option (b"%c".__mod__) but that's hardly what anyone would call obvious. However, during a conversation today, a possible solution occurred to me: a "bytes.chr" class method, that served as an alternate constructor. That idea results in the following 3 part proposal: 1. Add "bytes.chr" such that "bytes.chr(x)" is equivalent to the PEP 361 defined "b'%c' % x" 2. Add "bytearray.allnull" and "bytes.allnull" such that "bytearray.allnull(x)" is equivalent to the current "bytearray(x)" int handling 3. Deprecate the current "bytes(x)" and "bytearray(x)" int handling as not only ambiguous, but actually a genuine bug magnet (it's way too easy to accidentally pass a large integer and try to allocate a ridiculously large bytes object) For point 2, I also considered the following alternative names before settling on "allnull": - bytes.null sounds too much like an alias for b"\x00" - bytes.nulls just sounded too awkward to say (too many sibilants) - bytes.zeros I can never remember how to spell (bytes.zeroes?) - bytearray.cleared sort of worked, but bytes.cleared? - ditto for bytearray.prealloc and bytes.prealloc (latter makes no sense) That last is also a very C-ish name (although it is a rather C-ish operation). Anyway, what do people think? Does anyone actually *like* the way the bytes constructor in Python 3 currently handles integers and want to keep it forever? Does the above proposal sound like a reasonable suggestion for improvement in 3.5? Does this hit PEP territory, since it's changing the signature and API of a builtin? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Fri, Mar 28, 2014 at 08:27:33PM +1000, Nick Coghlan wrote:
1. Add "bytes.chr" such that "bytes.chr(x)" is equivalent to the PEP 361 defined "b'%c' % x"
+1 on the concept, but please not "chr". How about bytes.byte(x)? That emphasises that you are dealing with a single byte, and avoids conflating chars (strings of length 1) with bytes.
+1 on bytearray.allnull, with a mild preference for spelling it "zeroes" instead. I'm indifferent about bytes.allnull. I'm not sure why I would use it in preference to b'\0'*x, or for that matter why I would use it at all. I suppose if there's a bytearray.allnul, for symmetry there should be a bytes.allnul as well.
+1 [...]
Not me!
On the vanishingly small chance that there is universal agreement on this, and a minimum of bike-shedding, I think it would still be useful to write up a brief half page PEP linking to the discussion here. -- Steven

On 28 March 2014 20:50, Steven D'Aprano <steve@pearwood.info> wrote:
If I could consistently remember whether to include the "e" or not, "zeroes" would be my preference as well, but almost every time I go to type it, I have to pause...
It's actually in the proposal solely because "bytes(x)" already works that way, and "this is deprecated, use bytes.allnull instead" is a much easier deprecation to sell.
Yeah, that's make sense, so regardless of what happens in this thread, I'll still be putting a PEP together and asking Guido for his verdict. Heaps of time before 3.5 though... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 29Mar2014 01:16, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Please get this right. Zeroes. Two 'e's. Otherwise I will feel pain every time I encounter zeros. Looks Greek. -- Cameron Simpson <cs@zip.com.au> Yes, we plot no less than the destruction of the West. Just the other day a friend and I came up with the most pernicious academic scheme to date for toppling the West: He will kneel behind the West on all fours. I will push it backwards over him. - Michael Berube

On Fri, Mar 28, 2014 at 2:59 PM, Cameron Simpson <cs@zip.com.au> wrote:
zeros is way more popular than zeroes: https://books.google.com/ngrams/graph?content=zeroes%2Czeros&year_start=1900&year_end=2013&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Czeroes%3B%2Cc0%3B.t1%3B%2Czeros%3B%2Cc0

On 29 March 2014 17:52, Gregory P. Smith <greg@krypto.org> wrote:
This subthread is confirming my original instinct to just avoid the problem entirely because *either* choice is inevitably confusing for a non-trivial subset of users :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan wrote:
How about just calling it "zero"? There's really no need for it to be plural. After all, we refer to int(0) as just "zero", even though there may be more than one zero bit in its representation. I really don't like "allnull", it just sounds wrong, too ascii-centric. -- Greg

On 30Mar2014 10:31, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Unfortunately, "zero" reads like a verb. One might zero an array, for example. As a noun it would imply exactly one. It's a good idea though; I guess I'm -0 purely for verb/noun reasons. If numpy already has zeros() with a very similar meaning I would live with the misspelling for the sake of consistency. If the numpy zeros does something different then it would carry no weight with me. Hoping for something more betterer, -- Cameron Simpson <cs@zip.com.au> For those who understand, NO explanation is needed, for those who don't understand, NO explanation will be given! - Davey D <decoster@vnet.ibm.com>

Cameron Simpson wrote:
Please get this right. Zeroes. Two 'e's.
Otherwise I will feel pain every time I encounter zeros. Looks Greek.
I sympathi{s,z}e, but I feel that having it inconsistent with with numpy would cause more pain overall. Ah, I know -- Python should treat identifiers ending in "-os" and "-oes" as equivalent! That would solve all problems of this kind once and for all. "import oes"-ly y'rs, Greg

On Mar 28, 2014 5:01 AM, "Nick Coghlan" <ncoghlan@gmail.com> wrote:
However, it reads better, which trumps the possible minor spelling inconvenience. (all else equal, readability trumps write-ability, etc.)
Fair point, though I also agree with Steven here. Could the replacement (b'\0'*x) be a doc note instead of a new method? -eric

On 2014-03-28, Steven D'Aprano wrote:
I think numpy uses 'zeros' so we should use that.
It's odd, IMHO. When looking to implement %-interpolation for bytes I did some searching to see how widely it is used. There are a few uses in the standard library but it really should have been a special constructor. Accidental uses probably outnumber legitimate ones, quite a bad API design. Neil

On Fri, Mar 28, 2014 at 09:21:15AM -0600, Neil Schemenauer wrote:
We would we want to duplicate their spelling error? *wink* I think I'd rather Nick's original suggestion allnull than "zeros". "Zeros" sounds like it ought to be the name of a Greek singer :-) I'm aware that zeros/zeroes are both considered acceptable variant spellings. Not acceptable to me *wink* -- Steven

On Fri, 28 Mar 2014 20:27:33 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Which other bytes interfaces are you talking about?
You mean bytes.chr(x) is equivalent to bytes([x]). The intent is slightly more obvious indeed, so I'd inclined to be +0, but Python isn't really a worse language if it doesn't have that alternative spelling.
I don't like it, but I also don't think it's enough of a nuisance to be deprecated. (the indexing behaviour of bytes objects is far more annoying) Regards Antoine.

On 28 March 2014 21:22, Antoine Pitrou <solipsis@pitrou.net> wrote:
The ones where a length 1 bytes object and the corresponding integer are interchangeable. For example, containment testing:
Compare:
That's the inconsistency that elevates the current constructor behaviour from weird to actively wrong for me - it doesn't match the way other bytes interfaces have evolved over the course of the Python 3 series.
Oops, I forgot to explain the context where this idea came up: I was trying to figure out how to iterate over a bytes object or wrap an indexing operation to get a length 1 byte sequence rather than an integer. Currently: probably muck about with lambda or a comprehension With this change (using Steven D'Aprano's suggested name): for x in map(bytes.byte, data): # x is a length 1 bytes object, not an int x = bytes.byte(data[0]) # ditto bytes.byte could actually apply the same policy as some other APIs and also accept ASCII text code points in addition to length 1 bytes objects and integers below 256. Since changing the iteration and indexing behaviour of bytes and bytearray within the Python 3 series isn't feasible, this idea is about making the current behaviour easier to deal with. And yes, this is definitely going to need a PEP :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Fri, Mar 28, 2014, at 13:54, Benjamin Peterson wrote:
Should this be purely an iteration function, or something that can also index and slice [into objects that can subsequently be iterated or indexed]? What does six do about that?

2014-03-28 18:17 GMT+01:00 Guido van Rossum <guido@python.org>:
even if some people don’t like it, the iterating over bytes objects makes perfect sense. in every statically typed language, a byte is most commonly repesented as an integer in the range [0,255]. the annoyance has two sources: 1. python 2 byte strings (aka “native strings”) behaved like that (and unicode strings behave like that because python has no char type) 2. byte strings are represented like strings, including ascii-compatible parts as if they were ASCII text. therefore, paople think single bytes would be something akin to chars instead of simply integers from 0 to 255. b'a'[0] == 97 looks strange, but let’s not forget that it’s actually b'\x61'[0] == 0x61

On Fri, 28 Mar 2014 18:31:58 +0100 "Philipp A." <flying-sheep@web.de> wrote:
The actual source isn't merely cultural. It's that most of the time, bytes objects are used as containers of arbitrary binary data, not as arrays of integers (on which you would do element-wise arithmetic calculations, for instance). So, even if the current behaviour makes sense, it's not optimal for the common uses of bytes objects. Regards Antoine.

On Fri, Mar 28, 2014 at 12:42 PM, Antoine Pitrou <solipsis@pitrou.net>wrote:
I don't see bytes as integers, but as representations of integers, that typically come in a stream or array. Using an integer to represent that byte sometimes makes sense, but sometimes you wish to refer to the representation of the integer (the byte), not the integer itself.

Nick Coghlan writes:
+1 ~ 13:49$ python3.3 Python 3.3.5 (default, Mar 9 2014, 08:10:50) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information.
The behavior of str(3) is not exactly an argument *for* your claim, but it's not an argument against since there's no reason to expect a byte to have an ASCII interpretation in general.
bytes.chr(accidental_large_integer) presumably raises. I don't really see a bigger problem with the current behavior than that (in the accidentally large case). Both are likely to distress users.
No. bytearray I'd have to think about.
Does the above proposal sound like a reasonable suggestion for improvement in 3.5?
Yes.
Does this hit PEP territory, since it's changing the signature and API of a builtin?
No opinion. Steve

On 28 March 2014 10:27, Nick Coghlan <ncoghlan@gmail.com> wrote:
I hate it, I'd love to see a change. It bit me when I wrote my first Python-3-only project, and while it wasn't hard to debug it was utterly baffling for quite a while. Writing bytes([integer]) is pretty ugly, and it makes me sad each time I see it. I've even commented occurrences of it to remind myself that this is really the way I meant to do it.
I'd like to see it improved, even if I can't take advantage of that improvement for quite some time. Sounds like a PEP to me.

On 03/28/2014 06:27 AM, Nick Coghlan wrote:
I'd expect to use it this way... bytes.chr('3') Which wouldn't be correct. We have this...
(3).to_bytes(1,"little") b'\x03'
OR...
Both of those seem like way too much work to me. Bytes is a container object, so it makes sense to do... bytes([3]) The other interface you're suggesting is more along the lines of converting an object to bytes. That would be a nice feature if it can be made to work with more objects, and then covert back again. (even if only a few types was supported) b_three = bytes.from_obj(3) three = b_three.to_obj(int) That wouldn't be limited to values between 0 and 255 like the hex version above. (it doesn't need a length, or byteorder args.)
Numpy uses zeros and shape...
(Without the e.) I'm not sure if it's a good idea to use zeros with a different signature. There has been mentioned, that adding multidimensional slicing to python may be done. So it may be a good idea to follow numpy's usage.
Agree Cheers, Ron

On Fri, Mar 28, 2014 at 9:41 AM, Ron Adam <ron3200@gmail.com> wrote:
np.zeros() can be called with an integer (length) argument as well:
np.zeros(3) array([ 0., 0., 0.])
To be completely analogous to the proposed bytes constructor, you would have to specify the data type, though:
np.zeros(3, 'B') array([0, 0, 0], dtype=uint8)
+1 for bytes.zeros(n)

Great idea Nick. If I may dip my brush in some paint buckets. On Mar 28, 2014, at 08:27 PM, Nick Coghlan wrote:
I agree with Steven that bytes.byte() is a better spelling.
I like bytearray.fill() for this. The first argument would be the fill count, but it could take an optional second argument for the byte value to fill it with, which would of course default to zero. E.g.
+1
Does the above proposal sound like a reasonable suggestion for improvement in 3.5?
Very much so.
Does this hit PEP territory, since it's changing the signature and API of a builtin?
I don't much care either way. A PEP is not *wrong* (especially if we all start painting), but I think a tracker issue would be fine too. Cheers, -Barry

On Mar 28, 2014, at 07:31 AM, Ethan Furman wrote:
You mean like http://bugs.python.org/issue20895 ? :)
Step *away* from the time machine. -Barry

On 28/03/2014 14:28, Barry Warsaw wrote:
I was under the impression that Ethan Furman had raised an issue, or at least commented on one, but I couldn't find such a thing, am I simply mistaken? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com

On Fri Mar 28 2014 at 10:29:36 AM, Barry Warsaw <barry@python.org> wrote:
If we are going to add a classmethod, then yes, otherwise just stick with byte([x]).
I agree that if we are going to keep a classmethod to do this it should be like str.zfill(). Otherwise just rely on the multiply operator.
+1
PEP wouldn't hurt since it's a core type, but I'm not going to scream for it if Guido just says "fine" at the language summit or something (should we discuss this there?). Overall I think I'm +1 on the "simplify the API" by dropping the constructor issue and relying on the multiply operator instead of coming up with a new method. Same with bytes([x]) if you need to convert a single int. I'm also +1 on Benjamin's idea of the iterate-as-bytes addition. That just leaves the inability to index into bytes and get back a single as the biggest annoyance in the bytes API.

The issue remains that there is no single byte type, but what is returned is bytes type over which, again, you can iterate. This is different from a tuple of integers, when you iterate over that, you get integers not length-one tuples of integers. The issue, I think, originates from the string concept where iteration also return a string not a character - which python does not have. A logical extension would be to have to have character and byte types that are returned when iterating of strings and bytes. This is like taking a giraffe out of a zoo and still calling the result a (small) zoo. Or calling a can of tuna a grocery store.

On 03/29/2014 01:59 PM, Alexander Heger wrote:
Nice examples. The problem is that one *can* say: b'a' and get a 'bytes' object of length one, and then quite naturally one would then try something like b'abc'[0] == b'a' and that will fail, even though it *looks* correct. -- ~Ethan~

On 30 March 2014 07:09, Ethan Furman <ethan@stoneleaf.us> wrote:
FWIW, this is part of why Python 3 *didn't* originally have a bytes literal - the affordances of the str-like literal syntax are wrong for a tuple-of-integers type. We only added it back (pre 3.0) because people wanted it for working with wire protocols that contain ASCII segments (and I think that was a reasonable decision, even though it has contributed directly to people several years later still expecting them to behave more like Python 2 strings) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

28.03.14 12:27, Nick Coghlan написав(ла):
AFAIK currently the fastest way is: packB = struct.Struct('B').pack But lambda x: bytes([x]) looks most obvious.
-0. This is not very needed method and is not worth the language complication. bytes([x]) works pretty good in most cases.
-1. b'\0' * x does this.
+0.

On Sat, Mar 29, 2014 at 9:49 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
I agree they didn't need to be. But that's already happened. Consistency *is* convenient when reading and re-factoring code. Regardless, the real issue is that it is easy to inadvertently pass an integer to the bytes() constructor [and thus the bytearray() constructor] when you really intended to pass a single element sequence. +1 Deprecating the single integer argument to the constructor in 3.5 (DeprecationWarning) to be considered for removal in 3.6 or 3.7 sounds like a good idea. +1 At the same time, adding class methods: bytes*.byte*(97) and bytearray *.byte*(0x65) sound great. Those are *always explicit* about the programmer's intent rather than being an overloaded constructor that does different things based on the type passed in which is more prone to bugs. +1 that *bytearray*() does need a constructor that pre-allocates a bunch of empty (zero) space as that is a common need for a mutable type. That should be .*zfill*(n) for consistency. Filling with other values is way less common and doesn't deserve a .fill(n, value) method with potentially ambiguous parameters (which one is the count and which one is the value again? that'll be a continual question just as it is for C's memset and similar functions). -0 on the idea of a .zeros(n), .zeroes(n), .fill(n, value) or .zfill(n) class methods for the *bytes*() type. That is fine written as bytes.byte(0) * n as it is expected to be an uncommon operation. But if you want to add it for consistency, fine by me, change the sign of my preference. :) I don't think this is worthy of a PEP but won't object if you go that route. -gps

On 30 March 2014 04:03, Gregory P. Smith <greg@krypto.org> wrote:
We can't use .zfill(), as that is already used for the same purposes that it is used for with str and bytes objects (i.e. an ASCII zero-fill). I'm currently leaning towards the more explicit "from_len()" (with the fill value being optional, and defaulting to zero).
I already have a draft PEP written that covers the constructor issue, iteration and adding acceptance of integer inputs to the remaining methods that don't currently handle them. There was some background explanation of the text/binary domain split in the Python 2->3 transition that I wanted Guido's feedback on before posting, but I just realised I can cut that out for now, and then add it back after Guido has had a chance to review it. So I'll tidy that up and get the draft posted later today. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 30 March 2014 07:07, Nick Coghlan <ncoghlan@gmail.com> wrote:
Guido pointed out most of the stuff I had asked him to look at wasn't actually relevant to the PEP, so I just cut most of it entirely. Suffice to say, after stepping back and reviewing them systematically for the first time in years, I believe the APIs for the core binary data types in Python 3 could do with a little sprucing up :) Web version: http://www.python.org/dev/peps/pep-0467/ ====================================== PEP: 467 Title: Improved API consistency for bytes and bytearray Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan <ncoghlan@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-03-30 Python-Version: 3.5 Post-History: 2014-03-30 Abstract ======== During the initial development of the Python 3 language specification, the core ``bytes`` type for arbitrary binary data started as the mutable type that is now referred to as ``bytearray``. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series. This PEP proposes a number of small adjustments to the APIs of the ``bytes`` and ``bytearray`` types to make their behaviour more internally consistent and to make it easier to operate entirely in the binary domain for use cases that actually involve manipulating binary data directly, rather than converting it to a more structured form with additional modelling semantics (such as ``str``) and then converting back to binary format after processing. Background ========== Over the course of Python 3's evolution, a number of adjustments have been made to the core ``bytes`` and ``bytearray`` types as additional practical experience was gained with using them in code beyond the Python 3 standard library and test suite. However, to date, these changes have been made on a relatively ad hoc tactical basis as specific issues were identified, rather than as part of a systematic review of the APIs of these types. This approach has allowed inconsistencies to creep into the API design as to which input types are accepted by different methods. Additional inconsistencies linger from an earlier pre-release design where there was *no* separate ``bytearray`` type, and instead the core ``bytes`` type was mutable (with no immutable counterpart), as well as from the origins of these types in the text-like behaviour of the Python 2 ``str`` type. This PEP aims to provide the missing systematic review, with the goal of ensuring that wherever feasible (given backwards compatibility constraints) these current inconsistencies are addressed for the Python 3.5 release. Proposals ========= As a "consistency improvement" proposal, this PEP is actually about a number of smaller micro-proposals, each aimed at improving the self-consistency of the binary data model in Python 3. Proposals are motivated by one of three factors: * removing remnants of the original design of ``bytes`` as a mutable type * more consistently accepting length 1 ``bytes`` objects as input where an integer between ``0`` and ``255`` inclusive is expected, and vice-versa * allowing users to easily convert integer output to a length 1 ``bytes`` object Alternate Constructors ---------------------- The ``bytes`` and ``bytearray`` constructors currently accept an integer argument, but interpret it to mean a zero-filled object of the given length. This is a legacy of the original design of ``bytes`` as a mutable type, rather than a particularly intuitive behaviour for users. It has become especially confusing now that other ``bytes`` interfaces treat integers and the corresponding length 1 bytes instances as equivalent input. Compare:: >>> b"\x03" in bytes([1, 2, 3]) True >>> 3 in bytes([1, 2, 3]) True >>> bytes(b"\x03") b'\x03' >>> bytes(3) b'\x00\x00\x00' This PEP proposes that the current handling of integers in the bytes and bytearray constructors by deprecated in Python 3.5 and removed in Python 3.6, being replaced by two more type appropriate alternate constructors provided as class methods. The initial python-ideas thread [ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating this constructor behaviour. For ``bytes``, a ``byte`` constructor is proposed that converts integers (as indicated by ``operator.index``) in the appropriate range to a ``bytes`` object, converts objects that support the buffer API to bytes, and also passes through length 1 byte strings unchanged:: >>> bytes.byte(3) b'\x03' >>> bytes.byte(bytearray(bytes([3]))) b'\x03' >>> bytes.byte(memoryview(bytes([3]))) b'\x03' >>> bytes.byte(bytes([3])) b'\x03' >>> bytes.byte(512) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: bytes must be in range(0, 256) >>> bytes.byte(b"ab") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: bytes.byte() expected a byte, but buffer of length 2 found One specific use case for this alternate constructor is to easily convert the result of indexing operations on ``bytes`` and other binary sequences from an integer to a ``bytes`` object. The documentation for this API should note that its counterpart for the reverse conversion is ``ord()``. For ``bytearray``, a ``from_len`` constructor is proposed that preallocates the buffer filled with a particular value (default to ``0``) as a direct replacement for the current constructor behaviour, rather than having to use sequence repetition to achieve the same effect in a less intuitive way:: >>> bytearray.from_len(3) bytearray(b'\x00\x00\x00') >>> bytearray.from_len(3, 6) bytearray(b'\x06\x06\x06') This part of the proposal was covered by an existing issue [empty-buffer-issue]_ and a variety of names have been proposed (``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The specific name currently proposed was chosen by analogy with ``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely explicit that it is an alternate constructor rather than an in-place mutation, as well as how it differs from the standard constructor. Open questions ^^^^^^^^^^^^^^ * Should ``bytearray.byte()`` also be added? Or is ``bytearray(bytes.byte(x))`` sufficient for that case? * Should ``bytes.from_len()`` also be added? Or is sequence repetition sufficient for that case? * Should ``bytearray.from_len()`` use a different name? * Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary sequences with more than one element? The ``TypeError`` currently proposed is copied (with slightly improved wording) from the behaviour of ``ord()`` with sequences containing more than one code point, while ``ValueError`` would be more consistent with the existing handling of out-of-range integer values. * ``bytes.byte()`` is defined above as accepting length 1 binary sequences as individual bytes, but this is currently inconsistent with the main ``bytes`` constructor:: >>> bytes([b"a", b"b", b"c"]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'bytes' object cannot be interpreted as an integer Should the ``bytes`` constructor be changed to accept iterables of length 1 bytes objects in addition to iterables of integers? If so, should it allow a mixture of the two in a single iterable? Iteration --------- Iteration over ``bytes`` objects and other binary sequences produces integers. Rather than proposing a new method that would need to be added not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially to third party types as well, this PEP proposes that iteration to produce length 1 ``bytes`` objects instead be handled by combining ``map`` with the new ``bytes.byte()`` alternate constructor proposed above:: for x in map(bytes.byte, data): # x is a length 1 ``bytes`` object, rather than an integer # This works with *any* container of integers in the range # 0 to 255 inclusive Consistent support for different input types -------------------------------------------- In Python 3.3, the binary search operations (``in``, ``count()``, ``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to accept integers in the range 0 to 255 (inclusive) as their first argument (in addition to the existing support for binary sequences). This PEP proposes extending that behaviour of accepting integers as being equivalent to the corresponding length 1 binary sequence to several other ``bytes`` and ``bytearray`` methods that currently expect a ``bytes`` object for certain parameters. In essence, if a value is an acceptable input to the new ``bytes.byte`` constructor defined above, then it would be acceptable in the roles defined here (in addition to any other already supported inputs): * ``startswith()`` prefix(es) * ``endswith()`` suffix(es) * ``center()`` fill character * ``ljust()`` fill character * ``rjust()`` fill character * ``strip()`` character to strip * ``lstrip()`` character to strip * ``rstrip()`` character to strip * ``partition()`` separator argument * ``rpartition()`` separator argument * ``split()`` separator argument * ``rsplit()`` separator argument * ``replace()`` old value and new value In addition to the consistency motive, this approach also makes it easier to work with the indexing behaviour , as the result of an indexing operation can more easily be fed back in to other methods. For ``bytearray``, some additional changes are proposed to the current integer based operations to ensure they remain consistent with the proposed constructor changes:: * ``append()``: updated to be consistent with ``bytes.byte()`` * ``remove()``: updated to be consistent with ``bytes.byte()`` * ``+=``: updated to be consistent with ``bytes()`` changes (if any) * ``extend()``: updated to be consistent with ``bytes()`` changes (if any) Acknowledgement of surprising behaviour of some ``bytearray`` methods --------------------------------------------------------------------- Several of the ``bytes`` and ``bytearray`` methods have their origins in the Python 2 ``str`` API. As ``str`` is an immutable type, all of these operations are defined as returning a *new* instance, rather than operating in place. This contrasts with methods on other mutable types like ``list``, where ``list.sort()`` and ``list.reverse()`` operate in-place and return ``None``, rather than creating a new object. Backwards compatibility constraints make it impractical to change this behaviour at this point, but it may be appropriate to explicitly call out this quirk in the documentation for the ``bytearray`` type. It affects the following methods that could reasonably be expected to operate in-place on a mutable type: * ``center()`` * ``ljust()`` * ``rjust()`` * ``strip()`` * ``lstrip()`` * ``rstrip()`` * ``replace()`` * ``lower()`` * ``upper()`` * ``swapcase()`` * ``title()`` * ``capitalize()`` * ``translate()`` * ``expandtabs()`` * ``zfill()`` Note that the following ``bytearray`` operations *do* operate in place, as they're part of the mutable sequence API in ``bytearray``, rather than being inspired by the immutable Python 2 ``str`` API: * ``+=`` * ``append()`` * ``extend()`` * ``reverse()`` * ``remove()`` * ``pop()`` References ========== .. [ideas-thread1] https://mail.python.org/pipermail/python-ideas/2014-March/027295.html .. [empty-buffer-issue] http://bugs.python.org/issue20895 Copyright ========= This document has been placed in the public domain. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Mar 29, 2014 at 7:17 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I prefer keeping them consistent across the types myself. * Should ``bytearray.from_len()`` use a different name?
This name works for me.
I don't like that bytes.byte() would accept anything other than an int. It should not accept length 1 binary sequences at all. I'd prefer to see bytes.byte(b"X") raise a TypeError.
Where was a change to += behavior mentioned? I don't see that above (or did I miss something?).

On 30 March 2014 16:10, Gregory P. Smith <greg@krypto.org> wrote:
Unfortunately, it's not that simple, because accepting both is the only way I see of rendering the current APIs coherent. The problem is that the str-derived APIs expect bytes objects, the bytearray mutating methods expect integers, and in Python 3.3, the substring search APIs were updated to accept both. This means we currently have:
Since some APIs work one way, some work the other, the only backwards compatible path I see to consistency is to always treat a length 1 byte string as an acceptable input for the APIs that currently accept an integer and vice-versa. That said, I think this hybrid nature accurately reflects the fact that indexing and slicing bytes objects in Python 3 return different types - the individual elements are integers, but the subsequences are bytes objects, and several of these APIs are either "element-or-subsequence" APIs (in which case they should accept both), or else they *should* have been element APIs, but currently expect a subsequence due to their Python 2 str heritage. If we had the opportunity to redesign these APIs from scratch, we'd likely make a much clearer distinction between element based APIs (that would use integers) and subsequence APIs (that would accept buffer implementing objects). As it is, I think the situation is inherently ambiguous, and providing hybrid APIs to help deal with that ambiguity is our best available option.
It was an open question against the constructors - if bytes.byte() is defined as the PEP suggests, then the case can be made that the iterables accepted by the bytes() constructor should also be made more permissive in terms of the contents of the iterables it accepts. If *that* happens, then extending an existing bytearray should also become more permissive. Note that I'm not sold on actually changing that - that's why it's an open question, rather than something the PEP is currently proposing. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Mar 29, 2014 at 11:31 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Okay I see where you're going with this. So long as we limit this to APIs specifically surrounding bytes and bytearray I guess I'm "fine" with it (given the status quo and existing mess of never knowing which behaviors will be allowed where). Other APIs that accept numbers outside of bytes() and bytearray() related methods should *never* accept a bytes() of any length as valid numeric input. Thanks for exploring all of the APIs, we do have quite a mess that can be made better.

On 30 March 2014 17:03, Gregory P. Smith <greg@krypto.org> wrote:
Yeah, agreed. I've also reworked the relevant section of the PEP to give the examples I posted in my reply to you - I think they do a good job of showing why the current behaviour is problematic, and why "accept both" is the most plausible backwards compatible remedy available to us.
Thanks for exploring all of the APIs, we do have quite a mess that can be made better.
Thank Brandon Rhodes, too - he recently pointed me to bytes.replace() not accepting integers as a specific example of the current behaviour being confusing for users, and after I started down that rabbithole... well, this thread and the PEP were the end result :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Mar 29, 2014 at 7:17 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Thanks for cutting it down, it's easier to concentrate on the essentials now.
I hope you don't mind I cut the last 60% of this sentence (everything after "binary domain").
I'm not sure you can claim that. We probably have more information based on experience now than when we did the redesign. (At that time most experience was based on using str() for binary data.)
You make it sound as if modeling bytes() after Python 2's str() was an accident. It wasn't.
I would like to convince you to aim lower, drop the "systematic review", and just focus on some changes that are likely to improve users' experience (which includes porting Python 2 code).
Yes. * more consistently accepting length 1 ``bytes`` objects as input where an
integer between ``0`` and ``255`` inclusive is expected, and vice-versa
Not sure I like this as a goal. OK, stronger: I don't like this goal.
* allowing users to easily convert integer output to a length 1 ``bytes`` object
I think you meant integer values instead of output? In Python 2 we did this with the global function chr(), but in Python 3 that creates a str(). (The history of chr() and ord() sa built-in functions is that they long predates the notion of methods (class- or otherwise), and their naming comes straight from Pascal.) Anyway, I don't know that the use case is so common that it needs more than bytes([i]) or bytearray([i]) -- if there is an argument to be made for bytes.byte(i) and bytearray.byte(i) it would be that the [i] in the constructor is somewhat hard to grasp.
This is one of the two legacies of the original "mutable bytes" design, and I agree we should strive to replace it -- although I think one round of deprecation may be too quick. (The other legacy is of course that b[i] is an int, not a bytes -- it's the worse problem, but I don't think we can fix it without breaking more than the fix would be worth.)
I know why you reference this, but it feels confusing to me. At this point in the narrative it's better to just say "integer" and explain how it decides "integer-ness" later.
I think the second half (accepting bytes instances of length 1) is wrong here and doesn't actually have a practical use case. I'll say more below.
However, in a pinch, b[0] will do as well, assuming you don't need the length check implied by ord().
I think you need to brainstorm more on the name; from_len() looks pretty awkward. And I think it's better to add it to bytes() as well, since the two classes intentionally try to be as similar as possible.
It should be added.
* Should ``bytes.from_len()`` also be added? Or is sequence repetition sufficient for that case?
It should be added.
* Should ``bytearray.from_len()`` use a different name?
Yes.
It should not accept any bytes arguments. But if somehow you convince me otherwise, it should be ValueError (and honestly, ord() is wrong there).
Noooooooooooooooooooooooooo!!!!!
I can see why you don't like a new method, but this idiom is way too verbose and unintuitive to ever gain traction. Let's just add a new method to all three types, 3rd party types will get the message.
I wonder if that wasn't a bit over-zealous. While 'in', count() and index() are sequence methods (looking for elements) that have an extended meaning (looking for substrings) for string types, the find() and r*() variants are only defined for strings.
I think herein lies madness. The intention seems to be to paper over as much as possible the unfortunate behavior of b[i]. But how often does any of these methods get called with such a construct? And how often will that be in a context where this is the *only* thing that is affected by b[i] returning an int in Python 3 but a string in Python 2? (In my experience these are mostly called with literal arguments, except inside wrapper functions that are themselves intended to be called with a literal argument.) Weakening the type checking here seems a bad idea -- it would accept integers in *any* context, and that would just cause more nasty debugging issues.
Eew again. These are operations from the MutableSequence ABC and there is no reason to make their signatures fuzzier.
You make it sound as if this is a bad thing or an accident.
So does bytestring.reverse(). And if you really insist we can add bytestring.sort(). :-)
That all feels like hypercorrection. These are string methods and it would be completely wrong if bytearray changed them to modify the object in-place. I also don't see why anyone would think these would modify the object, given that everybody encounters these first for the str() type, then for bytes(), then finally (by extension) for bytearray(). The *only* place where there should be any confusion about whether the value is mutated or the variable is updated with a new object would be the += operator (and *=) but that's due to that operator's ambiguity.
Right. And there's nothing wrong with this.
-- --Guido van Rossum (python.org/~guido)

On 31 March 2014 02:05, Guido van Rossum <guido@python.org> wrote:
No worries - the shorter version is better. I've been spending too much time in recent months explaining the significant of the text model changes to different people, so now I end up trying to explain it any time I write about it :)
Yeah, I was mostly thinking of the change to make the search APIs accept both integers and subsequences when I wrote that. I'll try to think of a better way of wording it. I also realised in reviewing the docs that a key part of the problem may actually be a shortcut I took in the sequence docs rewrite that I did quite a while ago now - the bytes/bytearray docs are currently written in terms of *how they differ from str*. They don't really cover the "container of integers" aspect particularly well. I now believe a better way to tackle that would to be upfront that these types basically have two APIs on a single class: their core tuple/list of integers "arbitrary binary data" API, and then the str-inspired "binary data with ASCII segments" API on top of that.
Sorry, didn't mean to imply that. More that we hadn't previously sat down and thought through how best to clearly articulate this in the docs, and hence some of the current inconsistencies hadn't become clear.
After re-reading the current docs (which I also wrote, but this aspect was a mere afterthought at the time), I'm thinking a useful course of action in parallel with this PEP will be for me to work on improving the Python 3.4 docs for these types. The bits I find too hard to explain then become fodder for tweaks in 3.5.
Proposals =========
Roger. As noted above, I now think we can address this by splitting the API documentation instead, so that there's a clear "tuple/list of ints" section and a "binary data with ASCII segments" section. Some hybrid APIs (like the search ones) may appear in both. In terms of analogies to other types: Behaviour is common to tuple + str: hybrid API for bytes + bytearray Behaviour is list-only: int-only API for bytearray Behaviour is str-only: str-like only API for bytes + bytearray Now that I've framed the question that way, I think I can not only make it make sense in the docs, but I believe the 3.4 behaviour is already pretty close to consistent with it. The proposed bytes.byte() constructor would then more cleanly handle the few cases where it may be desirable to pass an int to a str-like API (such as replace())
Sort of - I was thinking of reversing the effects of indexing here. That is, replacing the current: x = data[0:1] with: x = bytes.byte(data[0])
Since I was mostly thinking about an alternative to slicing to convert an index lookup back to a bytes object, this doesn't seem appealing to me: x = bytes([data[0]]) The other one is that "bytes([i])" doesn't play nice with higher order functions like map. I don't expect wanting this to be *hugely* common, but I do think there's value in having the primitive conversion operation implied by the constructor behaviour available as a Python level operation.
Postponing removal to 3.7 or indefinitely is fine by me. While I think it should go away, I'm in no hurry to get rid of it - it started bothering me less once I realised you can already safely call bytes on arbitrary objects by passing them through memoryview first (as that doesn't have the mutable legacy that causes problems with integer input).
Sounds good (I actually meant to double check that we *do* currently accept arbitrary integer-like objects in the bytes constructor).
Agreed, I now think "the binary equivalent of chr()" would be much better behaviour here.
I initially liked Barry's "fill" suggestion, but then realised it read too much like an in-place operation (at least to my mind). Here are some examples (using Brett's suggestion of a keyword only second parameter): bytearray.zeros(3) # NumPy spelling, no configurable fill value bytearray.fill(3) bytearray.fill(3, fillvalue=6) bytearray.filled(3) bytearray.filled(3, fillvalue=6) To be honest, I'm actually coming around to the "just copy the 'zeros' name from NumPy and call it done" view on this one. I don't have a concrete use case for a custom fill value, and I think I'll learn quickly enough that it uses the shorter spelling.
Yeah, it bothered me, too :) As you suggest, I think it makes sense to extrapolate this the other way and change the definition of bytes.byte() to be a true inverse of ord() for binary data.
Fair enough. Is "iterbytes()" OK as the name?: for x in date.iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer # This works with *any* container of integers in the range # 0 to 255 inclusive
I suspect they're using the same underlying search code, although I haven't actually checked.
Yeah, I agree with this now. I was already starting to get "What's the actual use case here?" vibes while writing the PEP, and your reaction makes it easy to change my mind :) "replace()" seems like the only one where a reasonable case might be made to allowing integer input (and that's actually the one Brandon was asking about that got me thinking along these lines in the first place).
Again, not my intention. I think that impression will be easier to avoid once the PEP is recast as treating the issue as primarily a documentation problem, with just a few minor API tweaks.
Yeah, this all becomes substantially *less* surprising once these types are documented as effectively exposing two mostly distinct APIs (their underlying "container of ints" API for arbitrary binary data, and then the additional str-like API for binary data with ASCII compatible segments) I'm not sure when I'll get the PEP updated (since this isn't an urgent problem, I just wanted to get the initial draft of the PEP written while the problem was fresh in my mind), but I think the end result should be relatively non-controversial once I incorporate your feedback. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, Apr 1, 2014, at 9:26, Nick Coghlan wrote:
I have a mostly unrelated question. Is there a _general_ subsequence search function [that works on lists or even iterables], or is it only available for str and bytes?

On Tue, Apr 01, 2014 at 09:41:40AM -0400, random832@fastmail.us wrote:
There's nothing built-in, but here is a naive version: http://code.activestate.com/recipes/577850-search-sequences-for-sub-sequence... -- Steven

Nice come-back! Responses inline. On Tue, Apr 1, 2014 at 6:26 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
[...]
That might have felt ad-hoc, but it was in line with the idea that bytes follow the patterns of both tuple and string. (And similar for bytearray.)
Right, and then you can follow up with details about how the APIs differ from their equivalents in tuple and string. I imagine there are three categories here (some may be empty): APIs that are the union of the corresponding tuple and string APIs; APIs that are like one of the "base" classes with some restrictions or extensions that can't be explained by referring to the other "base"; and APIs that are unique to bytes.
Heh, I get defensive when you say bad things about the language. Not so much about the docs (too often I don't know what's in the docs myself because my knowledge predates the docs :-).
I am very much in favor of this approach. Many API improvements I've made myself have come from attempts to write documentation, and I imagine I'm not unique.
Heh, I think I just accidentally reinvented that same categorization above. :-)
Hm. I don't find that very attractive. You can't write Python 2/3 code using that idiom, and it's a lot longer than the original. The only redeeming feature is that it clearly fails when data is empty, and possibly that you don't have to compute the second index (which could be awkward if the first index is an expression). I'm not denying that we need bytes.byte(), but this doesn't sound like much of a motivation. Just pointing to the need of bytes/bytestring equivalents for chr() makes more sense to me.
Fair enough.
The other one is that "bytes([i])" doesn't play nice with higher order functions like map.
Also fair enough; having to define a helper function feels bad. All in all, I do think we need bytes.byte() and bytearray.byte(). We may just have to fine-tune the motivation a bit. :-)
Yes.
I'm not sure I quite see the use case. memoryview() doesn't take "arbitrary objects" -- it takes objects that implement the buffer protocol (if that's still the name :-). Are you saying that the advantage of going through memoryview() is that it fails fast when you accidentally pass it an integer(-like object)? [...]
+1 [...]
+1
+1
OK, it's water under the bridge anyway. [...]
I think not. It really works on substrings, length-one strings are just a common case. [...]
No hurries. And you're welcome! -- --Guido van Rossum (python.org/~guido)

On Apr 2, 2014, at 7:40 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I don’t like byte(), way to much potential for confusion with bytes(), but maybe bchr() is a reasonable thing. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On 2 Apr 2014 22:01, "Donald Stufft" <donald@stufft.io> wrote:
code using that idiom, and it's a lot longer than the like much of a motivation. Just pointing to the need
There's no need for it to be a builtin at all. The class method alternative constructor approach handles the problem just fine. Cheers, Nick.
DCFA

02.04.14 14:40, Nick Coghlan написав(ла):
I thought of that, but it seems like a recipe for typos and confusion. bytes.byte and bytearray.byte seem clearer and safer.
bytearray.byte looks deceptive. It returns not a byte, but 1-element bytearray. I doubt that creating 1-element bytearray is enough often case to add new special method (unlike to bytes.byte).

On 4 Apr 2014 05:03, "Serhiy Storchaka" <storchaka@gmail.com> wrote:
bytearray.
I doubt that creating 1-element bytearray is enough often case to add new
special method (unlike to bytes.byte). I actually agree, but Guido preferred the greater API consistency. Since I'm only -0 on bytearray.byte, I don't have much motivation to argue about it. Cheers, Nick.

Can we make bytes.from_len's second argument be keyword-only? Passing in two integer literals looks ambiguous if you don't know what the second argument is for. On Saturday, March 29, 2014 10:18:04 PM, Nick Coghlan <ncoghlan@gmail.com> wrote: On 30 March 2014 07:07, Nick Coghlan <ncoghlan <ncoghlan@gmail.com>@<ncoghlan@gmail.com> gmail.com <ncoghlan@gmail.com>> wrote:
Guido pointed out most of the stuff I had asked him to look at wasn't actually relevant to the PEP, so I just cut most of it entirely. Suffice to say, after stepping back and reviewing them systematically for the first time in years, I believe the APIs for the core binary data types in Python 3 could do with a little sprucing up :) Web version: http:// <http://www.python.org/dev/peps/pep-0467/> www.python.org <http://www.python.org/dev/peps/pep-0467/>/dev/<http://www.python.org/dev/peps/pep-0467/> peps <http://www.python.org/dev/peps/pep-0467/>/pep-0467/<http://www.python.org/dev/peps/pep-0467/> ====================================== PEP: 467 Title: Improved API consistency for bytes and bytearray Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan <ncoghlan <ncoghlan@gmail.com>@ <ncoghlan@gmail.com> gmail.com <ncoghlan@gmail.com>> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-03-30 Python-Version: 3.5 Post-History: 2014-03-30 Abstract ======== During the initial development of the Python 3 language specification, the core ``bytes`` type for arbitrary binary data started as the mutable type that is now referred to as ``bytearray``. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series. This PEP proposes a number of small adjustments to the APIs of the ``bytes`` and ``bytearray`` types to make their behaviour more internally consistent and to make it easier to operate entirely in the binary domain for use cases that actually involve manipulating binary data directly, rather than converting it to a more structured form with additional modelling semantics (such as ``str``) and then converting back to binary format after processing. Background ========== Over the course of Python 3's evolution, a number of adjustments have been made to the core ``bytes`` and ``bytearray`` types as additional practical experience was gained with using them in code beyond the Python 3 standard library and test suite. However, to date, these changes have been made on a relatively ad hoc tactical basis as specific issues were identified, rather than as part of a systematic review of the APIs of these types. This approach has allowed inconsistencies to creep into the API design as to which input types are accepted by different methods. Additional inconsistencies linger from an earlier pre-release design where there was *no* separate ``bytearray`` type, and instead the core ``bytes`` type was mutable (with no immutable counterpart), as well as from the origins of these types in the text-like behaviour of the Python 2 ``str`` type. This PEP aims to provide the missing systematic review, with the goal of ensuring that wherever feasible (given backwards compatibility constraints) these current inconsistencies are addressed for the Python 3.5 release. Proposals ========= As a "consistency improvement" proposal, this PEP is actually about a number of smaller micro-proposals, each aimed at improving the self-consistency of the binary data model in Python 3. Proposals are motivated by one of three factors: * removing remnants of the original design of ``bytes`` as a mutable type * more consistently accepting length 1 ``bytes`` objects as input where an integer between ``0`` and ``255`` inclusive is expected, and vice-versa * allowing users to easily convert integer output to a length 1 ``bytes`` object Alternate Constructors ---------------------- The ``bytes`` and ``bytearray`` constructors currently accept an integer argument, but interpret it to mean a zero-filled object of the given length. This is a legacy of the original design of ``bytes`` as a mutable type, rather than a particularly intuitive behaviour for users. It has become especially confusing now that other ``bytes`` interfaces treat integers and the corresponding length 1 bytes instances as equivalent input. Compare:: >>> b"\x03" in bytes([1, 2, 3]) True >>> 3 in bytes([1, 2, 3]) True >>> bytes(b"\x03") b'\x03' >>> bytes(3) b'\x00\x00\x00' This PEP proposes that the current handling of integers in the bytes and bytearray constructors by deprecated in Python 3.5 and removed in Python 3.6, being replaced by two more type appropriate alternate constructors provided as class methods. The initial python-ideas thread [ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating this constructor behaviour. For ``bytes``, a ``byte`` constructor is proposed that converts integers (as indicated by ``operator.index``) in the appropriate range to a ``bytes`` object, converts objects that support the buffer API to bytes, and also passes through length 1 byte strings unchanged:: >>> bytes.byte(3) b'\x03' >>> bytes.byte(bytearray(bytes([3]))) b'\x03' >>> bytes.byte(memoryview(bytes([3]))) b'\x03' >>> bytes.byte(bytes([3])) b'\x03' >>> bytes.byte(512) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: bytes must be in range(0, 256) >>> bytes.byte(b"ab") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: bytes.byte() expected a byte, but buffer of length 2 found One specific use case for this alternate constructor is to easily convert the result of indexing operations on ``bytes`` and other binary sequences from an integer to a ``bytes`` object. The documentation for this API should note that its counterpart for the reverse conversion is ``ord()``. For ``bytearray``, a ``from_len`` constructor is proposed that preallocates the buffer filled with a particular value (default to ``0``) as a direct replacement for the current constructor behaviour, rather than having to use sequence repetition to achieve the same effect in a less intuitive way:: >>> bytearray.from_len(3) bytearray(b'\x00\x00\x00') >>> bytearray.from_len(3, 6) bytearray(b'\x06\x06\x06' This part of the proposal was covered by an existing issue [empty-buffer-issue]_ and a variety of names have been proposed (``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The specific name currently proposed was chosen by analogy with ``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely explicit that it is an alternate constructor rather than an in-place mutation, as well as how it differs from the standard constructor. Open questions ^^^^^^^^^^^^^^ * Should ``bytearray.byte()`` also be added? Or is ``bytearray(bytes.byte(x))`` sufficient for that case? * Should ``bytes.from_len()`` also be added? Or is sequence repetition sufficient for that case? * Should ``bytearray.from_len()`` use a different name? * Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary sequences with more than one element? The ``TypeError`` currently proposed is copied (with slightly improved wording) from the behaviour of ``ord()`` with sequences containing more than one code point, while ``ValueError`` would be more consistent with the existing handling of out-of-range integer values. * ``bytes.byte()`` is defined above as accepting length 1 binary sequences as individual bytes, but this is currently inconsistent with the main ``bytes`` constructor:: >>> bytes([b"a", b"b", b"c"]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'bytes' object cannot be interpreted as an integer Should the ``bytes`` constructor be changed to accept iterables of length 1 bytes objects in addition to iterables of integers? If so, should it allow a mixture of the two in a single iterable? Iteration --------- Iteration over ``bytes`` objects and other binary sequences produces integers. Rather than proposing a new method that would need to be added not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially to third party types as well, this PEP proposes that iteration to produce length 1 ``bytes`` objects instead be handled by combining ``map`` with the new ``bytes.byte()`` alternate constructor proposed above:: for x in map(bytes.byte, data): # x is a length 1 ``bytes`` object, rather than an integer # This works with *any* container of integers in the range # 0 to 255 inclusive Consistent support for different input types -------------------------------------------- In Python 3.3, the binary search operations (``in``, ``count()``, ``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to accept integers in the range 0 to 255 (inclusive) as their first argument (in addition to the existing support for binary sequences). This PEP proposes extending that behaviour of accepting integers as being equivalent to the corresponding length 1 binary sequence to several other ``bytes`` and ``bytearray`` methods that currently expect a ``bytes`` object for certain parameters. In essence, if a value is an acceptable input to the new ``bytes.byte`` constructor defined above, then it would be acceptable in the roles defined here (in addition to any other already supported inputs): * ``startswith()`` prefix(es) * ``endswith()`` suffix(es) * ``center()`` fill character * ``ljust()`` fill character * ``rjust()`` fill character * ``strip()`` character to strip * ``lstrip()`` character to strip * ``rstrip()`` character to strip * ``partition()`` separator argument * ``rpartition()`` separator argument * ``split()`` separator argument * ``rsplit()`` separator argument * ``replace()`` old value and new value In addition to the consistency motive, this approach also makes it easier to work with the indexing behaviour , as the result of an indexing operation can more easily be fed back in to other methods. For ``bytearray``, some additional changes are proposed to the current integer based operations to ensure they remain consistent with the proposed constructor changes:: * ``append()``: updated to be consistent with ``bytes.byte()`` * ``remove()``: updated to be consistent with ``bytes.byte()`` * ``+=``: updated to be consistent with ``bytes()`` changes (if any) * ``extend()``: updated to be consistent with ``bytes()`` changes (if any) Acknowledgement of surprising behaviour of some ``bytearray`` methods --------------------------------------------------------------------- Several of the ``bytes`` and ``bytearray`` methods have their origins in the Python 2 ``str`` API. As ``str`` is an immutable type, all of these operations are defined as returning a *new* instance, rather than operating in place. This contrasts with methods on other mutable types like ``list``, where ``list.sort()`` and ``list.reverse()`` operate in-place and return ``None``, rather than creating a new object. Backwards compatibility constraints make it impractical to change this behaviour at this point, but it may be appropriate to explicitly call out this quirk in the documentation for the ``bytearray`` type. It affects the following methods that could reasonably be expected to operate in-place on a mutable type: * ``center()`` * ``ljust()`` * ``rjust()`` * ``strip()`` * ``lstrip()`` * ``rstrip()`` * ``replace()`` * ``lower()`` * ``upper()`` * ``swapcase()`` * ``title()`` * ``capitalize()`` * ``translate()`` * ``expandtabs()`` * ``zfill()`` Note that the following ``bytearray`` operations *do* operate in place, as they're part of the mutable sequence API in ``bytearray``, rather than being inspired by the immutable Python 2 ``str`` API: * ``+=`` * ``append()`` * ``extend()`` * ``reverse()`` * ``remove()`` * ``pop()`` References ========== .. [ideas-thread1] https<https://mail.python.org/pipermail/python-ideas/2014-March/027295.html> :// <https://mail.python.org/pipermail/python-ideas/2014-March/027295.html> mail.python.org<https://mail.python.org/pipermail/python-ideas/2014-March/027295.html> / <https://mail.python.org/pipermail/python-ideas/2014-March/027295.html> pipermail<https://mail.python.org/pipermail/python-ideas/2014-March/027295.html> /python-ideas/2014-March/027295.<https://mail.python.org/pipermail/python-ideas/2014-March/027295.html> html <https://mail.python.org/pipermail/python-ideas/2014-March/027295.html> .. [empty-buffer-issue] http:// <http://bugs.python.org/issue20895> bugs.python.org <http://bugs.python.org/issue20895>/issue20895<http://bugs.python.org/issue20895> Copyright ========= This document has been placed in the public domain. -- Nick Coghlan | ncoghlan <ncoghlan@gmail.com>@ <ncoghlan@gmail.com> gmail.com <ncoghlan@gmail.com> | Brisbane, Australia _______________________________________________ Python-ideas mailing list Python-ideas@ <Python-ideas@python.org>python.org <Python-ideas@python.org> https <https://mail.python.org/mailman/listinfo/python-ideas>://<https://mail.python.org/mailman/listinfo/python-ideas> mail.python.org <https://mail.python.org/mailman/listinfo/python-ideas> /mailman/ <https://mail.python.org/mailman/listinfo/python-ideas>listinfo<https://mail.python.org/mailman/listinfo/python-ideas> /python-ideas <https://mail.python.org/mailman/listinfo/python-ideas> Code of Conduct: http:// <http://python.org/psf/codeofconduct/>python.org<http://python.org/psf/codeofconduct/> / <http://python.org/psf/codeofconduct/>psf<http://python.org/psf/codeofconduct/> / <http://python.org/psf/codeofconduct/>codeofconduct<http://python.org/psf/codeofconduct/> / <http://python.org/psf/codeofconduct/>

On Fri, Mar 28, 2014 at 08:27:33PM +1000, Nick Coghlan wrote:
1. Add "bytes.chr" such that "bytes.chr(x)" is equivalent to the PEP 361 defined "b'%c' % x"
+1 on the concept, but please not "chr". How about bytes.byte(x)? That emphasises that you are dealing with a single byte, and avoids conflating chars (strings of length 1) with bytes.
+1 on bytearray.allnull, with a mild preference for spelling it "zeroes" instead. I'm indifferent about bytes.allnull. I'm not sure why I would use it in preference to b'\0'*x, or for that matter why I would use it at all. I suppose if there's a bytearray.allnul, for symmetry there should be a bytes.allnul as well.
+1 [...]
Not me!
On the vanishingly small chance that there is universal agreement on this, and a minimum of bike-shedding, I think it would still be useful to write up a brief half page PEP linking to the discussion here. -- Steven

On 28 March 2014 20:50, Steven D'Aprano <steve@pearwood.info> wrote:
If I could consistently remember whether to include the "e" or not, "zeroes" would be my preference as well, but almost every time I go to type it, I have to pause...
It's actually in the proposal solely because "bytes(x)" already works that way, and "this is deprecated, use bytes.allnull instead" is a much easier deprecation to sell.
Yeah, that's make sense, so regardless of what happens in this thread, I'll still be putting a PEP together and asking Guido for his verdict. Heaps of time before 3.5 though... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 29Mar2014 01:16, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Please get this right. Zeroes. Two 'e's. Otherwise I will feel pain every time I encounter zeros. Looks Greek. -- Cameron Simpson <cs@zip.com.au> Yes, we plot no less than the destruction of the West. Just the other day a friend and I came up with the most pernicious academic scheme to date for toppling the West: He will kneel behind the West on all fours. I will push it backwards over him. - Michael Berube

On Fri, Mar 28, 2014 at 2:59 PM, Cameron Simpson <cs@zip.com.au> wrote:
zeros is way more popular than zeroes: https://books.google.com/ngrams/graph?content=zeroes%2Czeros&year_start=1900&year_end=2013&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Czeroes%3B%2Cc0%3B.t1%3B%2Czeros%3B%2Cc0

On 29 March 2014 17:52, Gregory P. Smith <greg@krypto.org> wrote:
This subthread is confirming my original instinct to just avoid the problem entirely because *either* choice is inevitably confusing for a non-trivial subset of users :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan wrote:
How about just calling it "zero"? There's really no need for it to be plural. After all, we refer to int(0) as just "zero", even though there may be more than one zero bit in its representation. I really don't like "allnull", it just sounds wrong, too ascii-centric. -- Greg

On 30Mar2014 10:31, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Unfortunately, "zero" reads like a verb. One might zero an array, for example. As a noun it would imply exactly one. It's a good idea though; I guess I'm -0 purely for verb/noun reasons. If numpy already has zeros() with a very similar meaning I would live with the misspelling for the sake of consistency. If the numpy zeros does something different then it would carry no weight with me. Hoping for something more betterer, -- Cameron Simpson <cs@zip.com.au> For those who understand, NO explanation is needed, for those who don't understand, NO explanation will be given! - Davey D <decoster@vnet.ibm.com>

Cameron Simpson wrote:
Please get this right. Zeroes. Two 'e's.
Otherwise I will feel pain every time I encounter zeros. Looks Greek.
I sympathi{s,z}e, but I feel that having it inconsistent with with numpy would cause more pain overall. Ah, I know -- Python should treat identifiers ending in "-os" and "-oes" as equivalent! That would solve all problems of this kind once and for all. "import oes"-ly y'rs, Greg

On Mar 28, 2014 5:01 AM, "Nick Coghlan" <ncoghlan@gmail.com> wrote:
However, it reads better, which trumps the possible minor spelling inconvenience. (all else equal, readability trumps write-ability, etc.)
Fair point, though I also agree with Steven here. Could the replacement (b'\0'*x) be a doc note instead of a new method? -eric

On 2014-03-28, Steven D'Aprano wrote:
I think numpy uses 'zeros' so we should use that.
It's odd, IMHO. When looking to implement %-interpolation for bytes I did some searching to see how widely it is used. There are a few uses in the standard library but it really should have been a special constructor. Accidental uses probably outnumber legitimate ones, quite a bad API design. Neil

On Fri, Mar 28, 2014 at 09:21:15AM -0600, Neil Schemenauer wrote:
We would we want to duplicate their spelling error? *wink* I think I'd rather Nick's original suggestion allnull than "zeros". "Zeros" sounds like it ought to be the name of a Greek singer :-) I'm aware that zeros/zeroes are both considered acceptable variant spellings. Not acceptable to me *wink* -- Steven

On Fri, 28 Mar 2014 20:27:33 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Which other bytes interfaces are you talking about?
You mean bytes.chr(x) is equivalent to bytes([x]). The intent is slightly more obvious indeed, so I'd inclined to be +0, but Python isn't really a worse language if it doesn't have that alternative spelling.
I don't like it, but I also don't think it's enough of a nuisance to be deprecated. (the indexing behaviour of bytes objects is far more annoying) Regards Antoine.

On 28 March 2014 21:22, Antoine Pitrou <solipsis@pitrou.net> wrote:
The ones where a length 1 bytes object and the corresponding integer are interchangeable. For example, containment testing:
Compare:
That's the inconsistency that elevates the current constructor behaviour from weird to actively wrong for me - it doesn't match the way other bytes interfaces have evolved over the course of the Python 3 series.
Oops, I forgot to explain the context where this idea came up: I was trying to figure out how to iterate over a bytes object or wrap an indexing operation to get a length 1 byte sequence rather than an integer. Currently: probably muck about with lambda or a comprehension With this change (using Steven D'Aprano's suggested name): for x in map(bytes.byte, data): # x is a length 1 bytes object, not an int x = bytes.byte(data[0]) # ditto bytes.byte could actually apply the same policy as some other APIs and also accept ASCII text code points in addition to length 1 bytes objects and integers below 256. Since changing the iteration and indexing behaviour of bytes and bytearray within the Python 3 series isn't feasible, this idea is about making the current behaviour easier to deal with. And yes, this is definitely going to need a PEP :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Fri, Mar 28, 2014, at 13:54, Benjamin Peterson wrote:
Should this be purely an iteration function, or something that can also index and slice [into objects that can subsequently be iterated or indexed]? What does six do about that?

2014-03-28 18:17 GMT+01:00 Guido van Rossum <guido@python.org>:
even if some people don’t like it, the iterating over bytes objects makes perfect sense. in every statically typed language, a byte is most commonly repesented as an integer in the range [0,255]. the annoyance has two sources: 1. python 2 byte strings (aka “native strings”) behaved like that (and unicode strings behave like that because python has no char type) 2. byte strings are represented like strings, including ascii-compatible parts as if they were ASCII text. therefore, paople think single bytes would be something akin to chars instead of simply integers from 0 to 255. b'a'[0] == 97 looks strange, but let’s not forget that it’s actually b'\x61'[0] == 0x61

On Fri, 28 Mar 2014 18:31:58 +0100 "Philipp A." <flying-sheep@web.de> wrote:
The actual source isn't merely cultural. It's that most of the time, bytes objects are used as containers of arbitrary binary data, not as arrays of integers (on which you would do element-wise arithmetic calculations, for instance). So, even if the current behaviour makes sense, it's not optimal for the common uses of bytes objects. Regards Antoine.

On Fri, Mar 28, 2014 at 12:42 PM, Antoine Pitrou <solipsis@pitrou.net>wrote:
I don't see bytes as integers, but as representations of integers, that typically come in a stream or array. Using an integer to represent that byte sometimes makes sense, but sometimes you wish to refer to the representation of the integer (the byte), not the integer itself.

Nick Coghlan writes:
+1 ~ 13:49$ python3.3 Python 3.3.5 (default, Mar 9 2014, 08:10:50) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information.
The behavior of str(3) is not exactly an argument *for* your claim, but it's not an argument against since there's no reason to expect a byte to have an ASCII interpretation in general.
bytes.chr(accidental_large_integer) presumably raises. I don't really see a bigger problem with the current behavior than that (in the accidentally large case). Both are likely to distress users.
No. bytearray I'd have to think about.
Does the above proposal sound like a reasonable suggestion for improvement in 3.5?
Yes.
Does this hit PEP territory, since it's changing the signature and API of a builtin?
No opinion. Steve

On 28 March 2014 10:27, Nick Coghlan <ncoghlan@gmail.com> wrote:
I hate it, I'd love to see a change. It bit me when I wrote my first Python-3-only project, and while it wasn't hard to debug it was utterly baffling for quite a while. Writing bytes([integer]) is pretty ugly, and it makes me sad each time I see it. I've even commented occurrences of it to remind myself that this is really the way I meant to do it.
I'd like to see it improved, even if I can't take advantage of that improvement for quite some time. Sounds like a PEP to me.

On 03/28/2014 06:27 AM, Nick Coghlan wrote:
I'd expect to use it this way... bytes.chr('3') Which wouldn't be correct. We have this...
(3).to_bytes(1,"little") b'\x03'
OR...
Both of those seem like way too much work to me. Bytes is a container object, so it makes sense to do... bytes([3]) The other interface you're suggesting is more along the lines of converting an object to bytes. That would be a nice feature if it can be made to work with more objects, and then covert back again. (even if only a few types was supported) b_three = bytes.from_obj(3) three = b_three.to_obj(int) That wouldn't be limited to values between 0 and 255 like the hex version above. (it doesn't need a length, or byteorder args.)
Numpy uses zeros and shape...
(Without the e.) I'm not sure if it's a good idea to use zeros with a different signature. There has been mentioned, that adding multidimensional slicing to python may be done. So it may be a good idea to follow numpy's usage.
Agree Cheers, Ron

On Fri, Mar 28, 2014 at 9:41 AM, Ron Adam <ron3200@gmail.com> wrote:
np.zeros() can be called with an integer (length) argument as well:
np.zeros(3) array([ 0., 0., 0.])
To be completely analogous to the proposed bytes constructor, you would have to specify the data type, though:
np.zeros(3, 'B') array([0, 0, 0], dtype=uint8)
+1 for bytes.zeros(n)

Great idea Nick. If I may dip my brush in some paint buckets. On Mar 28, 2014, at 08:27 PM, Nick Coghlan wrote:
I agree with Steven that bytes.byte() is a better spelling.
I like bytearray.fill() for this. The first argument would be the fill count, but it could take an optional second argument for the byte value to fill it with, which would of course default to zero. E.g.
+1
Does the above proposal sound like a reasonable suggestion for improvement in 3.5?
Very much so.
Does this hit PEP territory, since it's changing the signature and API of a builtin?
I don't much care either way. A PEP is not *wrong* (especially if we all start painting), but I think a tracker issue would be fine too. Cheers, -Barry

On Mar 28, 2014, at 07:31 AM, Ethan Furman wrote:
You mean like http://bugs.python.org/issue20895 ? :)
Step *away* from the time machine. -Barry

On 28/03/2014 14:28, Barry Warsaw wrote:
I was under the impression that Ethan Furman had raised an issue, or at least commented on one, but I couldn't find such a thing, am I simply mistaken? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com
participants (27)
-
Alexander Belopolsky
-
Alexander Heger
-
Antoine Pitrou
-
Barry Warsaw
-
Benjamin Peterson
-
Brett Cannon
-
Cameron Simpson
-
Cory Benfield
-
Devin Jeanpierre
-
Donald Stufft
-
Dr. Brett Cannon
-
Eric Snow
-
Ethan Furman
-
Greg Ewing
-
Gregory P. Smith
-
Guido van Rossum
-
Mark Lawrence
-
Neil Girdhar
-
Neil Schemenauer
-
Nick Coghlan
-
Philipp A.
-
random832@fastmail.us
-
Ron Adam
-
Ryan Hiebert
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steven D'Aprano