[Python-ideas] A possible transition plan to bytes-based iteration and indexing for binary data

Sun Jun 15 14:33:14 CEST 2014

At PyCon earlier this year, Guido (and others) persuaded me that the
integer based indexing and iteration for bytes and bytearray in Python
3 was a genuine design mistake based on the initial Python 3 design
which lacked an immutable bytes type entirely (so producing integers
was originally the only reasonable choice).

The earlier design discussions around PEP 467 (which proposes to clean
up a few other bits and pieces of that original legacy which PEP 3137
left in place) all treated "bytes indexing returns an integer" as an
unchangeable aspect of Python 3, since there wasn't an obvious way to
migrate to instead returning length 1 bytes objects with a reasonable
story to handle the incompatibility for Python 3 users, even if
everyone was in favour of the end result.

A few weeks ago I had an idea for a migration strategy that seemed
feasible, and I now have a very, very preliminary proof of concept up
at https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experiment

The general principle involved would be to return an integer *subtype*
from indexing and iteration operations on bytes, bytearray and
memoryview objects using the "default" format character. That subtype
would then be detected in various locations and handled the way a
length 1 bytes object would be handled, rather than the way an integer
would be handled. The current proof of concept adds such handling to
ord(), bytes() and bytearray() (with appropriate test cases in
test_bytes) giving the following results:

>>> b'hello'[0]
104
>>> ord(b'hello'[0])
104
>>> bytes(b'hello'[0])
b'h'
>>> bytearray(b'hello'[0])
bytearray(b'h')

(the subtype is currently visible at the Python level as "types._BytesInt")

The proof of concept doesn't override any normal integer behaviour,
but a more complete solution would be in a position to emit a warning
when the result of binary indexing is used as an integer (either
always, or controlled by a command line switch, depending on the
performance impact).

With this integer subtype in place for Python 3.5 to provide a
transition period where both existing integer-compatible operations
(like int() and arithmetic operations) and selected bytes-compatible
operations (like ord(), bytes() and bytearray()) are supported, these
operations could then be switched to producing a normal length 1 bytes
object in Python 3.6.

It wouldn't be pretty, and it would be a pain to document, but it
seems feasible. The alternative is for PEP 367 to add a separate bytes
iteration method, which strikes me as further entrenching a design we
aren't currently happy with.

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia