[Python-ideas] A possible transition plan to bytes-based iteration and indexing for binary data

Sun Jun 15 23:57:12 CEST 2014

On Sun, Jun 15, 2014 at 2:42 PM, Dr. Brett Cannon <bcannon at gmail.com> wrote:

> Why do we need a fancy subtype when a future statement could get us the
> semantics we want without breaking anything? I realize it won't work with
> 2.7 but at least it gives us some way forward that isn't quite so delicate.

how could it?  within a single file where such a statement applies there is
no knowledge of what types are.

In order for this to work you would need to have your __future__ statement
alter the behavior of *all* [] and iteration done within the file to
conditionally take a code path that does something different iff the type
being operated on is determined at runtime to be bytes.

-gps

>
> On Sun, Jun 15, 2014, 10:11, Gregory P. Smith <greg at krypto.org> wrote:
>
>>  On Sun, Jun 15, 2014 at 5:33 AM, Nick Coghlan <ncoghlan at gmail.com>
>> wrote:
>>
>>> At PyCon earlier this year, Guido (and others) persuaded me that the
>>> integer based indexing and iteration for bytes and bytearray in Python
>>> 3 was a genuine design mistake based on the initial Python 3 design
>>> which lacked an immutable bytes type entirely (so producing integers
>>> was originally the only reasonable choice).
>>>
>>> The earlier design discussions around PEP 467 (which proposes to clean
>>> up a few other bits and pieces of that original legacy which PEP 3137
>>> left in place) all treated "bytes indexing returns an integer" as an
>>> unchangeable aspect of Python 3, since there wasn't an obvious way to
>>> migrate to instead returning length 1 bytes objects with a reasonable
>>> story to handle the incompatibility for Python 3 users, even if
>>> everyone was in favour of the end result.
>>>
>>> A few weeks ago I had an idea for a migration strategy that seemed
>>> feasible, and I now have a very, very preliminary proof of concept up
>>> at
>>> https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experiment
>>>
>>> The general principle involved would be to return an integer *subtype*
>>> from indexing and iteration operations on bytes, bytearray and
>>> memoryview objects using the "default" format character. That subtype
>>> would then be detected in various locations and handled the way a
>>> length 1 bytes object would be handled, rather than the way an integer
>>> would be handled. The current proof of concept adds such handling to
>>> ord(), bytes() and bytearray() (with appropriate test cases in
>>> test_bytes) giving the following results:
>>>
>>> >>> b'hello'[0]
>>> 104
>>> >>> ord(b'hello'[0])
>>> 104
>>> >>> bytes(b'hello'[0])
>>> b'h'
>>> >>> bytearray(b'hello'[0])
>>> bytearray(b'h')
>>>
>>> (the subtype is currently visible at the Python level as
>>> "types._BytesInt")
>>>
>>> The proof of concept doesn't override any normal integer behaviour,
>>> but a more complete solution would be in a position to emit a warning
>>> when the result of binary indexing is used as an integer (either
>>> always, or controlled by a command line switch, depending on the
>>> performance impact).
>>>
>>> With this integer subtype in place for Python 3.5 to provide a
>>> transition period where both existing integer-compatible operations
>>> (like int() and arithmetic operations) and selected bytes-compatible
>>> operations (like ord(), bytes() and bytearray()) are supported, these
>>> operations could then be switched to producing a normal length 1 bytes
>>> object in Python 3.6.
>>>
>>> It wouldn't be pretty, and it would be a pain to document, but it
>>> seems feasible. The alternative is for PEP 367 to add a separate bytes
>>>
>>
>> I believe you mean PEP 467.
>>
>>
>>> iteration method, which strikes me as further entrenching a design we
>>> aren't currently happy with.
>>>
>>> Regards,
>>> Nick.
>>
>>
>> We just got rid of the mess of having multiple integer types (int vs
>> long), it'd be a shame to recreate that problem in any form.
>>
>> The ship has sailed. Python 3 means bytes indexing returns ints. It's
>> well defined and code has started to depend on it. People who want a b'A'
>> instead of 0x41 know to use slice notation [n:n+1] instead of [n] to get a
>> one byte bytes() as that is what is required in code that works in 2.6
>> through 3.4 today. Anything we do to change it is going to be messier and
>> more mysterious.
>>
>> Entertaining the idea anyways: If there is going to be a new type for
>> bytes indexing, it needs to multiply inherit from both int and bytes so
>> that isinstance() checks work. We'd need to make sure all C API calls that
>> check for a specific type actually work with the new one as well (at first
>> glance I count 57 uses of PyBytes_CheckExact and PyLong_CheckExact in
>> CPython). The ambiguious operator * and + cases and any similar that
>> Nathaniel Smith pointed out would still be a problem and a potential source
>> of confusion for users.
>>
>> If anything, a new iteration method in PEP 467 that yields length 1
>> bytes() makes *some* sense for convenience, but I don't personally see
>> much use for single byte iteration of any form in a high level language.
>>
>> It is odd to me that str and bytes *ever* supported iteration. How many
>> times have we each written code to check that a passed argument was "a
>> sequence but, oh, wait, not a string, because you didn't *really* mean
>> to do that". That was a Python 1 decision. Oops. :)
>>
>> -gps
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140615/1a8faa81/attachment.html>