A possible transition plan to bytes-based iteration and indexing for binary data

At PyCon earlier this year, Guido (and others) persuaded me that the integer based indexing and iteration for bytes and bytearray in Python 3 was a genuine design mistake based on the initial Python 3 design which lacked an immutable bytes type entirely (so producing integers was originally the only reasonable choice). The earlier design discussions around PEP 467 (which proposes to clean up a few other bits and pieces of that original legacy which PEP 3137 left in place) all treated "bytes indexing returns an integer" as an unchangeable aspect of Python 3, since there wasn't an obvious way to migrate to instead returning length 1 bytes objects with a reasonable story to handle the incompatibility for Python 3 users, even if everyone was in favour of the end result. A few weeks ago I had an idea for a migration strategy that seemed feasible, and I now have a very, very preliminary proof of concept up at https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experi... The general principle involved would be to return an integer *subtype* from indexing and iteration operations on bytes, bytearray and memoryview objects using the "default" format character. That subtype would then be detected in various locations and handled the way a length 1 bytes object would be handled, rather than the way an integer would be handled. The current proof of concept adds such handling to ord(), bytes() and bytearray() (with appropriate test cases in test_bytes) giving the following results:
b'hello'[0] 104 ord(b'hello'[0]) 104 bytes(b'hello'[0]) b'h' bytearray(b'hello'[0]) bytearray(b'h')
(the subtype is currently visible at the Python level as "types._BytesInt") The proof of concept doesn't override any normal integer behaviour, but a more complete solution would be in a position to emit a warning when the result of binary indexing is used as an integer (either always, or controlled by a command line switch, depending on the performance impact). With this integer subtype in place for Python 3.5 to provide a transition period where both existing integer-compatible operations (like int() and arithmetic operations) and selected bytes-compatible operations (like ord(), bytes() and bytearray()) are supported, these operations could then be switched to producing a normal length 1 bytes object in Python 3.6. It wouldn't be pretty, and it would be a pain to document, but it seems feasible. The alternative is for PEP 367 to add a separate bytes iteration method, which strikes me as further entrenching a design we aren't currently happy with. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le 15/06/2014 08:33, Nick Coghlan a écrit :
The general principle involved would be to return an integer *subtype* from indexing and iteration operations on bytes, bytearray and memoryview objects using the "default" format character. That subtype would then be detected in various locations and handled the way a length 1 bytes object would be handled, rather than the way an integer would be handled. The current proof of concept adds such handling to ord(), bytes() and bytearray() (with appropriate test cases in test_bytes) giving the following results:
b'hello'[0] 104 ord(b'hello'[0]) 104 bytes(b'hello'[0]) b'h' bytearray(b'hello'[0]) bytearray(b'h')
That sounds terribly confusing to me. I'd rather live with the current behaviour. Regards Antoine.

On Sun, Jun 15, 2014 at 10:33:14PM +1000, Nick Coghlan wrote:
At PyCon earlier this year, Guido (and others) persuaded me that the integer based indexing and iteration for bytes and bytearray in Python 3 was a genuine design mistake based on the initial Python 3 design which lacked an immutable bytes type entirely (so producing integers was originally the only reasonable choice). [...] The general principle involved would be to return an integer *subtype*
Have you considered subclassing bytes, rather than int? for i in b"foo": assert isinstance(i, int) for b in sensible_bytes(b"foo"): assert isinstance(b, bytes) I'm not wedded to the name :-) And then, perhaps some time in the distant future when porting from Python 2.7 is no longer a priority, we can add from __future__ import bytes_iteration_yields_bytes There's at least two obvious downsides: the b'' syntax will still refer to the less useful type, and it will be a violation of the Liskov substitution principle (but then I've always considered that to be a guideline rather than a hard law).
It wouldn't be pretty, and it would be a pain to document, but it seems feasible. The alternative is for PEP 367 to add a separate bytes iteration method, which strikes me as further entrenching a design we aren't currently happy with.
Unless you have a strategy to deprecate *and remove* the magic int subclass some time in the foreseeable future, you're still entrenching the design. I think whatever we do, we're going to end up with something ugly in the language. Possibly the least ugly, and certainly the least magic, is a separate bytes iteration method. Keeping-an-open-mind-but-leaning-towards-minus-one-on-the-idea-ly y'rs, -- Steven

On 15 Jun 2014 16:25, "Steven D'Aprano" <steve@pearwood.info> wrote:
On Sun, Jun 15, 2014 at 10:33:14PM +1000, Nick Coghlan wrote:
At PyCon earlier this year, Guido (and others) persuaded me that the integer based indexing and iteration for bytes and bytearray in Python 3 was a genuine design mistake based on the initial Python 3 design which lacked an immutable bytes type entirely (so producing integers was originally the only reasonable choice). [...] The general principle involved would be to return an integer *subtype*
Have you considered subclassing bytes, rather than int?
Isn't the obvious answer to subclass both? This would require a bit of fiddling to ensure memory layout compatibility, but seems feasible to me [1]. So b"abcd" would give a bytes object, and b"abcd"[0] would an inty_bytes object, which acts like an int in int contexts and likes a bytes in bytes contexts. E.g., inty_bytes + int -> int (and warns) inty_bytes + bytes -> bytes Bonus points if we can make isinstance(inty_bytes, int) warn too. The main obstacle I see is that there are a small number of operations that are well defined for both bytes and int objects with different semantics: inty_bytes * int -> ? inty_bytes + inty_bytes -> ? I suspect these will be a major challenge for any transition scheme. (Is it even viable to make bytes method behaviour dependent on a __future__ import? I guess this would require stack frame inspection?) -n [1] specifically I envision adding an unexposed base class that has the struct fields required by int but no methods, making int and bytes both inherit from it, and the inty_bytes would inherit from both. This wastes a bit of memory in each bytes object, but only during the transition.

A further thought comes to mind... On Sun, Jun 15, 2014 at 10:33:14PM +1000, Nick Coghlan wrote: [...]
The general principle involved would be to return an integer *subtype*
bytes(b'hello'[0]) b'h'
Hmmm. This is, I think, worrying. Now you have two sorts of ints: a = b'hello'[0] b = 104 assert a == b # succeeds assert bytes(a) == bytes(b) # fails I can see problems where one of these _ByteInts gets used where you're expecting a regular int, or visa versa, and you're left with a silent failure and perplexing, hard to diagnose behaviour. -- Steven

On Mon, Jun 16, 2014 at 1:36 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Hmmm. This is, I think, worrying. Now you have two sorts of ints:
a = b'hello'[0] b = 104 assert a == b # succeeds assert bytes(a) == bytes(b) # fails
ISTM the problem here is the bytes(104) constructor, which is of marginal utility anyway. If that could be configured to produce a warning, that would solve the problem, right? You might get that assertion failing, but you'd get a warning that explains why. ChrisA

On Sun, Jun 15, 2014 at 5:33 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At PyCon earlier this year, Guido (and others) persuaded me that the integer based indexing and iteration for bytes and bytearray in Python 3 was a genuine design mistake based on the initial Python 3 design which lacked an immutable bytes type entirely (so producing integers was originally the only reasonable choice).
The earlier design discussions around PEP 467 (which proposes to clean up a few other bits and pieces of that original legacy which PEP 3137 left in place) all treated "bytes indexing returns an integer" as an unchangeable aspect of Python 3, since there wasn't an obvious way to migrate to instead returning length 1 bytes objects with a reasonable story to handle the incompatibility for Python 3 users, even if everyone was in favour of the end result.
A few weeks ago I had an idea for a migration strategy that seemed feasible, and I now have a very, very preliminary proof of concept up at https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experi...
The general principle involved would be to return an integer *subtype* from indexing and iteration operations on bytes, bytearray and memoryview objects using the "default" format character. That subtype would then be detected in various locations and handled the way a length 1 bytes object would be handled, rather than the way an integer would be handled. The current proof of concept adds such handling to ord(), bytes() and bytearray() (with appropriate test cases in test_bytes) giving the following results:
b'hello'[0] 104 ord(b'hello'[0]) 104 bytes(b'hello'[0]) b'h' bytearray(b'hello'[0]) bytearray(b'h')
(the subtype is currently visible at the Python level as "types._BytesInt")
The proof of concept doesn't override any normal integer behaviour, but a more complete solution would be in a position to emit a warning when the result of binary indexing is used as an integer (either always, or controlled by a command line switch, depending on the performance impact).
With this integer subtype in place for Python 3.5 to provide a transition period where both existing integer-compatible operations (like int() and arithmetic operations) and selected bytes-compatible operations (like ord(), bytes() and bytearray()) are supported, these operations could then be switched to producing a normal length 1 bytes object in Python 3.6.
It wouldn't be pretty, and it would be a pain to document, but it seems feasible. The alternative is for PEP 367 to add a separate bytes
I believe you mean PEP 467.
iteration method, which strikes me as further entrenching a design we aren't currently happy with.
Regards, Nick.
We just got rid of the mess of having multiple integer types (int vs long), it'd be a shame to recreate that problem in any form. The ship has sailed. Python 3 means bytes indexing returns ints. It's well defined and code has started to depend on it. People who want a b'A' instead of 0x41 know to use slice notation [n:n+1] instead of [n] to get a one byte bytes() as that is what is required in code that works in 2.6 through 3.4 today. Anything we do to change it is going to be messier and more mysterious. Entertaining the idea anyways: If there is going to be a new type for bytes indexing, it needs to multiply inherit from both int and bytes so that isinstance() checks work. We'd need to make sure all C API calls that check for a specific type actually work with the new one as well (at first glance I count 57 uses of PyBytes_CheckExact and PyLong_CheckExact in CPython). The ambiguious operator * and + cases and any similar that Nathaniel Smith pointed out would still be a problem and a potential source of confusion for users. If anything, a new iteration method in PEP 467 that yields length 1 bytes() makes *some* sense for convenience, but I don't personally see much use for single byte iteration of any form in a high level language. It is odd to me that str and bytes *ever* supported iteration. How many times have we each written code to check that a passed argument was "a sequence but, oh, wait, not a string, because you didn't *really* mean to do that". That was a Python 1 decision. Oops. :) -gps

Why do we need a fancy subtype when a future statement could get us the semantics we want without breaking anything? I realize it won't work with 2.7 but at least it gives us some way forward that isn't quite so delicate. On Sun, Jun 15, 2014, 10:11, Gregory P. Smith <greg@krypto.org> wrote:
On Sun, Jun 15, 2014 at 5:33 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At PyCon earlier this year, Guido (and others) persuaded me that the integer based indexing and iteration for bytes and bytearray in Python 3 was a genuine design mistake based on the initial Python 3 design which lacked an immutable bytes type entirely (so producing integers was originally the only reasonable choice).
The earlier design discussions around PEP 467 (which proposes to clean up a few other bits and pieces of that original legacy which PEP 3137 left in place) all treated "bytes indexing returns an integer" as an unchangeable aspect of Python 3, since there wasn't an obvious way to migrate to instead returning length 1 bytes objects with a reasonable story to handle the incompatibility for Python 3 users, even if everyone was in favour of the end result.
A few weeks ago I had an idea for a migration strategy that seemed feasible, and I now have a very, very preliminary proof of concept up at https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experi...
The general principle involved would be to return an integer *subtype* from indexing and iteration operations on bytes, bytearray and memoryview objects using the "default" format character. That subtype would then be detected in various locations and handled the way a length 1 bytes object would be handled, rather than the way an integer would be handled. The current proof of concept adds such handling to ord(), bytes() and bytearray() (with appropriate test cases in test_bytes) giving the following results:
b'hello'[0] 104 ord(b'hello'[0]) 104 bytes(b'hello'[0]) b'h' bytearray(b'hello'[0]) bytearray(b'h')
(the subtype is currently visible at the Python level as "types._BytesInt")
The proof of concept doesn't override any normal integer behaviour, but a more complete solution would be in a position to emit a warning when the result of binary indexing is used as an integer (either always, or controlled by a command line switch, depending on the performance impact).
With this integer subtype in place for Python 3.5 to provide a transition period where both existing integer-compatible operations (like int() and arithmetic operations) and selected bytes-compatible operations (like ord(), bytes() and bytearray()) are supported, these operations could then be switched to producing a normal length 1 bytes object in Python 3.6.
It wouldn't be pretty, and it would be a pain to document, but it seems feasible. The alternative is for PEP 367 to add a separate bytes
I believe you mean PEP 467.
iteration method, which strikes me as further entrenching a design we aren't currently happy with.
Regards, Nick.
We just got rid of the mess of having multiple integer types (int vs long), it'd be a shame to recreate that problem in any form.
The ship has sailed. Python 3 means bytes indexing returns ints. It's well defined and code has started to depend on it. People who want a b'A' instead of 0x41 know to use slice notation [n:n+1] instead of [n] to get a one byte bytes() as that is what is required in code that works in 2.6 through 3.4 today. Anything we do to change it is going to be messier and more mysterious.
Entertaining the idea anyways: If there is going to be a new type for bytes indexing, it needs to multiply inherit from both int and bytes so that isinstance() checks work. We'd need to make sure all C API calls that check for a specific type actually work with the new one as well (at first glance I count 57 uses of PyBytes_CheckExact and PyLong_CheckExact in CPython). The ambiguious operator * and + cases and any similar that Nathaniel Smith pointed out would still be a problem and a potential source of confusion for users.
If anything, a new iteration method in PEP 467 that yields length 1 bytes() makes *some* sense for convenience, but I don't personally see much use for single byte iteration of any form in a high level language.
It is odd to me that str and bytes *ever* supported iteration. How many times have we each written code to check that a passed argument was "a sequence but, oh, wait, not a string, because you didn't *really* mean to do that". That was a Python 1 decision. Oops. :)
-gps _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Sun, Jun 15, 2014 at 2:42 PM, Dr. Brett Cannon <bcannon@gmail.com> wrote:
Why do we need a fancy subtype when a future statement could get us the semantics we want without breaking anything? I realize it won't work with 2.7 but at least it gives us some way forward that isn't quite so delicate.
how could it? within a single file where such a statement applies there is no knowledge of what types are. In order for this to work you would need to have your __future__ statement alter the behavior of *all* [] and iteration done within the file to conditionally take a code path that does something different iff the type being operated on is determined at runtime to be bytes. -gps
On Sun, Jun 15, 2014, 10:11, Gregory P. Smith <greg@krypto.org> wrote:
On Sun, Jun 15, 2014 at 5:33 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At PyCon earlier this year, Guido (and others) persuaded me that the integer based indexing and iteration for bytes and bytearray in Python 3 was a genuine design mistake based on the initial Python 3 design which lacked an immutable bytes type entirely (so producing integers was originally the only reasonable choice).
The earlier design discussions around PEP 467 (which proposes to clean up a few other bits and pieces of that original legacy which PEP 3137 left in place) all treated "bytes indexing returns an integer" as an unchangeable aspect of Python 3, since there wasn't an obvious way to migrate to instead returning length 1 bytes objects with a reasonable story to handle the incompatibility for Python 3 users, even if everyone was in favour of the end result.
A few weeks ago I had an idea for a migration strategy that seemed feasible, and I now have a very, very preliminary proof of concept up at https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experi...
The general principle involved would be to return an integer *subtype* from indexing and iteration operations on bytes, bytearray and memoryview objects using the "default" format character. That subtype would then be detected in various locations and handled the way a length 1 bytes object would be handled, rather than the way an integer would be handled. The current proof of concept adds such handling to ord(), bytes() and bytearray() (with appropriate test cases in test_bytes) giving the following results:
b'hello'[0] 104 ord(b'hello'[0]) 104 bytes(b'hello'[0]) b'h' bytearray(b'hello'[0]) bytearray(b'h')
(the subtype is currently visible at the Python level as "types._BytesInt")
The proof of concept doesn't override any normal integer behaviour, but a more complete solution would be in a position to emit a warning when the result of binary indexing is used as an integer (either always, or controlled by a command line switch, depending on the performance impact).
With this integer subtype in place for Python 3.5 to provide a transition period where both existing integer-compatible operations (like int() and arithmetic operations) and selected bytes-compatible operations (like ord(), bytes() and bytearray()) are supported, these operations could then be switched to producing a normal length 1 bytes object in Python 3.6.
It wouldn't be pretty, and it would be a pain to document, but it seems feasible. The alternative is for PEP 367 to add a separate bytes
I believe you mean PEP 467.
iteration method, which strikes me as further entrenching a design we aren't currently happy with.
Regards, Nick.
We just got rid of the mess of having multiple integer types (int vs long), it'd be a shame to recreate that problem in any form.
The ship has sailed. Python 3 means bytes indexing returns ints. It's well defined and code has started to depend on it. People who want a b'A' instead of 0x41 know to use slice notation [n:n+1] instead of [n] to get a one byte bytes() as that is what is required in code that works in 2.6 through 3.4 today. Anything we do to change it is going to be messier and more mysterious.
Entertaining the idea anyways: If there is going to be a new type for bytes indexing, it needs to multiply inherit from both int and bytes so that isinstance() checks work. We'd need to make sure all C API calls that check for a specific type actually work with the new one as well (at first glance I count 57 uses of PyBytes_CheckExact and PyLong_CheckExact in CPython). The ambiguious operator * and + cases and any similar that Nathaniel Smith pointed out would still be a problem and a potential source of confusion for users.
If anything, a new iteration method in PEP 467 that yields length 1 bytes() makes *some* sense for convenience, but I don't personally see much use for single byte iteration of any form in a high level language.
It is odd to me that str and bytes *ever* supported iteration. How many times have we each written code to check that a passed argument was "a sequence but, oh, wait, not a string, because you didn't *really* mean to do that". That was a Python 1 decision. Oops. :)
-gps _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Gregory P. Smith wrote:
In order for this to work you would need to have your __future__ statement alter the behavior of *all* [] and iteration done within the file to conditionally take a code path that does something different iff the type being operated on is determined at runtime to be bytes.
It *could* be done. When the future statement is in effect, different bytecodes could be generated for indexing and iteration that look out for bytes and work differently. -- Greg

On 15.06.2014 23:42, Dr. Brett Cannon wrote:
Why do we need a fancy subtype when a future statement could get us the semantics we want without breaking anything? I realize it won't work with 2.7 but at least it gives us some way forward that isn't quite so delicate.
Whatever the solution, +100 on making the change default in Python 3.6 :-)
On Sun, Jun 15, 2014, 10:11, Gregory P. Smith <greg@krypto.org> wrote:
On Sun, Jun 15, 2014 at 5:33 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At PyCon earlier this year, Guido (and others) persuaded me that the integer based indexing and iteration for bytes and bytearray in Python 3 was a genuine design mistake based on the initial Python 3 design which lacked an immutable bytes type entirely (so producing integers was originally the only reasonable choice).
The earlier design discussions around PEP 467 (which proposes to clean up a few other bits and pieces of that original legacy which PEP 3137 left in place) all treated "bytes indexing returns an integer" as an unchangeable aspect of Python 3, since there wasn't an obvious way to migrate to instead returning length 1 bytes objects with a reasonable story to handle the incompatibility for Python 3 users, even if everyone was in favour of the end result.
A few weeks ago I had an idea for a migration strategy that seemed feasible, and I now have a very, very preliminary proof of concept up at https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experi...
The general principle involved would be to return an integer *subtype* from indexing and iteration operations on bytes, bytearray and memoryview objects using the "default" format character. That subtype would then be detected in various locations and handled the way a length 1 bytes object would be handled, rather than the way an integer would be handled. The current proof of concept adds such handling to ord(), bytes() and bytearray() (with appropriate test cases in test_bytes) giving the following results:
b'hello'[0] 104 ord(b'hello'[0]) 104 bytes(b'hello'[0]) b'h' bytearray(b'hello'[0]) bytearray(b'h')
(the subtype is currently visible at the Python level as "types._BytesInt")
The proof of concept doesn't override any normal integer behaviour, but a more complete solution would be in a position to emit a warning when the result of binary indexing is used as an integer (either always, or controlled by a command line switch, depending on the performance impact).
With this integer subtype in place for Python 3.5 to provide a transition period where both existing integer-compatible operations (like int() and arithmetic operations) and selected bytes-compatible operations (like ord(), bytes() and bytearray()) are supported, these operations could then be switched to producing a normal length 1 bytes object in Python 3.6.
It wouldn't be pretty, and it would be a pain to document, but it seems feasible. The alternative is for PEP 367 to add a separate bytes
I believe you mean PEP 467.
iteration method, which strikes me as further entrenching a design we aren't currently happy with.
Regards, Nick.
We just got rid of the mess of having multiple integer types (int vs long), it'd be a shame to recreate that problem in any form.
The ship has sailed. Python 3 means bytes indexing returns ints. It's well defined and code has started to depend on it. People who want a b'A' instead of 0x41 know to use slice notation [n:n+1] instead of [n] to get a one byte bytes() as that is what is required in code that works in 2.6 through 3.4 today. Anything we do to change it is going to be messier and more mysterious.
Entertaining the idea anyways: If there is going to be a new type for bytes indexing, it needs to multiply inherit from both int and bytes so that isinstance() checks work. We'd need to make sure all C API calls that check for a specific type actually work with the new one as well (at first glance I count 57 uses of PyBytes_CheckExact and PyLong_CheckExact in CPython). The ambiguious operator * and + cases and any similar that Nathaniel Smith pointed out would still be a problem and a potential source of confusion for users.
If anything, a new iteration method in PEP 467 that yields length 1 bytes() makes *some* sense for convenience, but I don't personally see much use for single byte iteration of any form in a high level language.
It is odd to me that str and bytes *ever* supported iteration. How many times have we each written code to check that a passed argument was "a sequence but, oh, wait, not a string, because you didn't *really* mean to do that". That was a Python 1 decision. Oops. :)
-gps _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

+1 on "the ship has sailed". Let's live with the consequences rather than introduce yet another change. The change will cause more friction than getting used to the current behavior. On Sun, Jun 15, 2014 at 10:03 AM, Gregory P. Smith <greg@krypto.org> wrote:
On Sun, Jun 15, 2014 at 5:33 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At PyCon earlier this year, Guido (and others) persuaded me that the integer based indexing and iteration for bytes and bytearray in Python 3 was a genuine design mistake based on the initial Python 3 design which lacked an immutable bytes type entirely (so producing integers was originally the only reasonable choice).
The earlier design discussions around PEP 467 (which proposes to clean up a few other bits and pieces of that original legacy which PEP 3137 left in place) all treated "bytes indexing returns an integer" as an unchangeable aspect of Python 3, since there wasn't an obvious way to migrate to instead returning length 1 bytes objects with a reasonable story to handle the incompatibility for Python 3 users, even if everyone was in favour of the end result.
A few weeks ago I had an idea for a migration strategy that seemed feasible, and I now have a very, very preliminary proof of concept up at https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experi...
The general principle involved would be to return an integer *subtype* from indexing and iteration operations on bytes, bytearray and memoryview objects using the "default" format character. That subtype would then be detected in various locations and handled the way a length 1 bytes object would be handled, rather than the way an integer would be handled. The current proof of concept adds such handling to ord(), bytes() and bytearray() (with appropriate test cases in test_bytes) giving the following results:
b'hello'[0] 104 ord(b'hello'[0]) 104 bytes(b'hello'[0]) b'h' bytearray(b'hello'[0]) bytearray(b'h')
(the subtype is currently visible at the Python level as "types._BytesInt")
The proof of concept doesn't override any normal integer behaviour, but a more complete solution would be in a position to emit a warning when the result of binary indexing is used as an integer (either always, or controlled by a command line switch, depending on the performance impact).
With this integer subtype in place for Python 3.5 to provide a transition period where both existing integer-compatible operations (like int() and arithmetic operations) and selected bytes-compatible operations (like ord(), bytes() and bytearray()) are supported, these operations could then be switched to producing a normal length 1 bytes object in Python 3.6.
It wouldn't be pretty, and it would be a pain to document, but it seems feasible. The alternative is for PEP 367 to add a separate bytes
I believe you mean PEP 467.
iteration method, which strikes me as further entrenching a design we aren't currently happy with.
Regards, Nick.
We just got rid of the mess of having multiple integer types (int vs long), it'd be a shame to recreate that problem in any form.
The ship has sailed. Python 3 means bytes indexing returns ints. It's well defined and code has started to depend on it. People who want a b'A' instead of 0x41 know to use slice notation [n:n+1] instead of [n] to get a one byte bytes() as that is what is required in code that works in 2.6 through 3.4 today. Anything we do to change it is going to be messier and more mysterious.
Entertaining the idea anyways: If there is going to be a new type for bytes indexing, it needs to multiply inherit from both int and bytes so that isinstance() checks work. We'd need to make sure all C API calls that check for a specific type actually work with the new one as well (at first glance I count 57 uses of PyBytes_CheckExact and PyLong_CheckExact in CPython). The ambiguious operator * and + cases and any similar that Nathaniel Smith pointed out would still be a problem and a potential source of confusion for users.
If anything, a new iteration method in PEP 467 that yields length 1 bytes() makes *some* sense for convenience, but I don't personally see much use for single byte iteration of any form in a high level language.
It is odd to me that str and bytes *ever* supported iteration. How many times have we each written code to check that a passed argument was "a sequence but, oh, wait, not a string, because you didn't *really* mean to do that". That was a Python 1 decision. Oops. :)
-gps
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido)

On 16 Jun 2014 09:10, "Guido van Rossum" <guido@python.org> wrote:
+1 on "the ship has sailed". Let's live with the consequences rather than
introduce yet another change. The change will cause more friction than getting used to the current behavior. OK by me - I thought your reaction might be along those lines, which is why I posted the idea for feedback as soon as the proof of concept was even vaguely functional. I'll go back to the approach of improving the Python 3 bytes & bytearray docs before updating PEP 467 again. Cheers, Nick.
On Sun, Jun 15, 2014 at 10:03 AM, Gregory P. Smith <greg@krypto.org>
On Sun, Jun 15, 2014 at 5:33 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At PyCon earlier this year, Guido (and others) persuaded me that the integer based indexing and iteration for bytes and bytearray in Python 3 was a genuine design mistake based on the initial Python 3 design which lacked an immutable bytes type entirely (so producing integers was originally the only reasonable choice).
The earlier design discussions around PEP 467 (which proposes to clean up a few other bits and pieces of that original legacy which PEP 3137 left in place) all treated "bytes indexing returns an integer" as an unchangeable aspect of Python 3, since there wasn't an obvious way to migrate to instead returning length 1 bytes objects with a reasonable story to handle the incompatibility for Python 3 users, even if everyone was in favour of the end result.
A few weeks ago I had an idea for a migration strategy that seemed feasible, and I now have a very, very preliminary proof of concept up at
https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experi...
The general principle involved would be to return an integer *subtype* from indexing and iteration operations on bytes, bytearray and memoryview objects using the "default" format character. That subtype would then be detected in various locations and handled the way a length 1 bytes object would be handled, rather than the way an integer would be handled. The current proof of concept adds such handling to ord(), bytes() and bytearray() (with appropriate test cases in test_bytes) giving the following results:
b'hello'[0] 104 ord(b'hello'[0]) 104 bytes(b'hello'[0]) b'h' bytearray(b'hello'[0]) bytearray(b'h')
(the subtype is currently visible at the Python level as
"types._BytesInt")
The proof of concept doesn't override any normal integer behaviour, but a more complete solution would be in a position to emit a warning when the result of binary indexing is used as an integer (either always, or controlled by a command line switch, depending on the performance impact).
With this integer subtype in place for Python 3.5 to provide a transition period where both existing integer-compatible operations (like int() and arithmetic operations) and selected bytes-compatible operations (like ord(), bytes() and bytearray()) are supported, these operations could then be switched to producing a normal length 1 bytes object in Python 3.6.
It wouldn't be pretty, and it would be a pain to document, but it seems feasible. The alternative is for PEP 367 to add a separate bytes
I believe you mean PEP 467.
iteration method, which strikes me as further entrenching a design we aren't currently happy with.
Regards, Nick.
We just got rid of the mess of having multiple integer types (int vs long), it'd be a shame to recreate that problem in any form.
The ship has sailed. Python 3 means bytes indexing returns ints. It's well defined and code has started to depend on it. People who want a b'A' instead of 0x41 know to use slice notation [n:n+1] instead of [n] to get a one byte bytes() as that is what is required in code that works in 2.6
Entertaining the idea anyways: If there is going to be a new type for
bytes indexing, it needs to multiply inherit from both int and bytes so
If anything, a new iteration method in PEP 467 that yields length 1
bytes() makes some sense for convenience, but I don't personally see much use for single byte iteration of any form in a high level language.
It is odd to me that str and bytes ever supported iteration. How many
times have we each written code to check that a passed argument was "a sequence but, oh, wait, not a string, because you didn't really mean to do
wrote: through 3.4 today. Anything we do to change it is going to be messier and more mysterious. that isinstance() checks work. We'd need to make sure all C API calls that check for a specific type actually work with the new one as well (at first glance I count 57 uses of PyBytes_CheckExact and PyLong_CheckExact in CPython). The ambiguious operator * and + cases and any similar that Nathaniel Smith pointed out would still be a problem and a potential source of confusion for users. that". That was a Python 1 decision. Oops. :)
-gps
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido)
participants (10)
-
Antoine Pitrou
-
Chris Angelico
-
Dr. Brett Cannon
-
Greg Ewing
-
Gregory P. Smith
-
Guido van Rossum
-
M.-A. Lemburg
-
Nathaniel Smith
-
Nick Coghlan
-
Steven D'Aprano