[Python-ideas] A possible transition plan to bytes-based iteration and indexing for binary data

Sun Jun 15 23:42:02 CEST 2014

Why do we need a fancy subtype when a future statement could get us the
semantics we want without breaking anything? I realize it won't work with
2.7 but at least it gives us some way forward that isn't quite so delicate.

On Sun, Jun 15, 2014, 10:11, Gregory P. Smith <greg at krypto.org> wrote:

> On Sun, Jun 15, 2014 at 5:33 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>
>> At PyCon earlier this year, Guido (and others) persuaded me that the
>> integer based indexing and iteration for bytes and bytearray in Python
>> 3 was a genuine design mistake based on the initial Python 3 design
>> which lacked an immutable bytes type entirely (so producing integers
>> was originally the only reasonable choice).
>>
>> The earlier design discussions around PEP 467 (which proposes to clean
>> up a few other bits and pieces of that original legacy which PEP 3137
>> left in place) all treated "bytes indexing returns an integer" as an
>> unchangeable aspect of Python 3, since there wasn't an obvious way to
>> migrate to instead returning length 1 bytes objects with a reasonable
>> story to handle the incompatibility for Python 3 users, even if
>> everyone was in favour of the end result.
>>
>> A few weeks ago I had an idea for a migration strategy that seemed
>> feasible, and I now have a very, very preliminary proof of concept up
>> at
>> https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experiment
>>
>> The general principle involved would be to return an integer *subtype*
>> from indexing and iteration operations on bytes, bytearray and
>> memoryview objects using the "default" format character. That subtype
>> would then be detected in various locations and handled the way a
>> length 1 bytes object would be handled, rather than the way an integer
>> would be handled. The current proof of concept adds such handling to
>> ord(), bytes() and bytearray() (with appropriate test cases in
>> test_bytes) giving the following results:
>>
>> >>> b'hello'[0]
>> 104
>> >>> ord(b'hello'[0])
>> 104
>> >>> bytes(b'hello'[0])
>> b'h'
>> >>> bytearray(b'hello'[0])
>> bytearray(b'h')
>>
>> (the subtype is currently visible at the Python level as
>> "types._BytesInt")
>>
>> The proof of concept doesn't override any normal integer behaviour,
>> but a more complete solution would be in a position to emit a warning
>> when the result of binary indexing is used as an integer (either
>> always, or controlled by a command line switch, depending on the
>> performance impact).
>>
>> With this integer subtype in place for Python 3.5 to provide a
>> transition period where both existing integer-compatible operations
>> (like int() and arithmetic operations) and selected bytes-compatible
>> operations (like ord(), bytes() and bytearray()) are supported, these
>> operations could then be switched to producing a normal length 1 bytes
>> object in Python 3.6.
>>
>> It wouldn't be pretty, and it would be a pain to document, but it
>> seems feasible. The alternative is for PEP 367 to add a separate bytes
>>
>
> I believe you mean PEP 467.
>
>
>> iteration method, which strikes me as further entrenching a design we
>> aren't currently happy with.
>>
>> Regards,
>> Nick.
>
>
> We just got rid of the mess of having multiple integer types (int vs
> long), it'd be a shame to recreate that problem in any form.
>
> The ship has sailed. Python 3 means bytes indexing returns ints. It's well
> defined and code has started to depend on it. People who want a b'A'
> instead of 0x41 know to use slice notation [n:n+1] instead of [n] to get a
> one byte bytes() as that is what is required in code that works in 2.6
> through 3.4 today. Anything we do to change it is going to be messier and
> more mysterious.
>
> Entertaining the idea anyways: If there is going to be a new type for
> bytes indexing, it needs to multiply inherit from both int and bytes so
> that isinstance() checks work. We'd need to make sure all C API calls that
> check for a specific type actually work with the new one as well (at first
> glance I count 57 uses of PyBytes_CheckExact and PyLong_CheckExact in
> CPython). The ambiguious operator * and + cases and any similar that
> Nathaniel Smith pointed out would still be a problem and a potential source
> of confusion for users.
>
> If anything, a new iteration method in PEP 467 that yields length 1
> bytes() makes *some* sense for convenience, but I don't personally see
> much use for single byte iteration of any form in a high level language.
>
> It is odd to me that str and bytes *ever* supported iteration. How many
> times have we each written code to check that a passed argument was "a
> sequence but, oh, wait, not a string, because you didn't *really* mean to
> do that". That was a Python 1 decision. Oops. :)
>
> -gps
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140615/4ada711c/attachment-0001.html>