[Python-ideas] Bytestrings in Python 2

Nick Coghlan ncoghlan at gmail.com
Sun Apr 26 05:02:32 CEST 2015


On 26 April 2015 at 05:27, Markus Unterwaditzer
<markus at unterwaditzer.net> wrote:
> On Sat, Apr 25, 2015 at 07:14:57PM +0300, Serhiy Storchaka wrote:
>> Here is an idea that perhaps will help to prepare Python 2 code for
>> converting to Python 3.
>>
>> Currently bytes is just an alias of str in Python 2, and the "b" prefix of
>> string literals is ignored. There are no differences between natural strings
>> and bytes. I propose to add special bit to str instances and set it for
>> bytes literals and strings created from binary sources (read from binary
>> files, received from sockets, the result of unicode.encode() and
>> struct.pack(), etc). With -3 flag operations with binary strings that
>> doesn't allowed for bytes in Python 3 (e.g. encoding or coercing to unicode)
>> will emit a warning. Unfortunately we can't change the bytes constructor in
>> minor version, it should left an alias to str in 2.7. So the result of
>> bytes() will be not tagged as binary string.
>>
>> May be it is too late for this.
>
> You can get similar kinds of warnings with unicode-nazi
> (https://github.com/mitsuhiko/unicode-nazi), so I'm not sure if this would be
> that helpful.

Mentioning that utility in the porting guide could potentially be
useful, but I don't think it's a substitute for Serhiy's suggestion
here.

Serhiy's suggestion covers a slightly different situation, which is
that we can't warn about the following code snippet in Python 2, even
though we know bytes objects don't have an encode method in Python 3:

    "str".encode(encoding)

The reason is that we can't easily tell the difference between
something that is correct in both Python 2 & 3 like:

    "text".encode("utf-8")

(str->str encoding in Python 2, str->bytes encoding in Python 3)

and something that will break in Python 3 like:

    "data".encode("hex")

The single source version of the latter is actually
'codecs.encode(b"data", "hex")', but it's quite hard for an analyser
or test suite to pick that up and recommend the change, as it's hard
to tell the difference between "str-as-text-object" and
"str-as-binary-data-object" in Python 2.

Looking at the way string objects are stored in Python 2, it's
possible that the ob_sstate field (which tracks the interning status
of string instances) could potentially be co-opted to hold this
additional flag. If the new flag was *only* set when "-3" was
specified, then there'd only be a potential compatibility risk in that
context (the PyString_CHECK_INTERNED macro currently assumes that a
non-zero value in ob_sstate always indicates an interned string).

There'd be a more general performance risk however, as
PyString_CHECK_INTERNED would also need to be updated to either mask
out the new "this is probably binary data" state flag unconditionally,
or else to check the Py3k warning flag and mask out the new flag
conditionally. Either way, we'd be making "is this interned or not?"
checks slightly more expensive and the interpreter does a *lot* of
those.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-ideas mailing list