[Python-ideas] Bytestrings in Python 2

Sun Apr 26 11:22:05 CEST 2015

On Apr 25, 2015, at 20:02, Nick Coghlan <ncoghlan at gmail.com> wrote:
> 
> On 26 April 2015 at 05:27, Markus Unterwaditzer
> <markus at unterwaditzer.net> wrote:
>> On Sat, Apr 25, 2015 at 07:14:57PM +0300, Serhiy Storchaka wrote:
>>> Here is an idea that perhaps will help to prepare Python 2 code for
>>> converting to Python 3.
>>> 
>>> Currently bytes is just an alias of str in Python 2, and the "b" prefix of
>>> string literals is ignored. There are no differences between natural strings
>>> and bytes. I propose to add special bit to str instances and set it for
>>> bytes literals and strings created from binary sources (read from binary
>>> files, received from sockets, the result of unicode.encode() and
>>> struct.pack(), etc). With -3 flag operations with binary strings that
>>> doesn't allowed for bytes in Python 3 (e.g. encoding or coercing to unicode)
>>> will emit a warning. Unfortunately we can't change the bytes constructor in
>>> minor version, it should left an alias to str in 2.7. So the result of
>>> bytes() will be not tagged as binary string.
>>> 
>>> May be it is too late for this.
>> 
>> You can get similar kinds of warnings with unicode-nazi
>> (https://github.com/mitsuhiko/unicode-nazi), so I'm not sure if this would be
>> that helpful.
> 
> Mentioning that utility in the porting guide could potentially be
> useful, but I don't think it's a substitute for Serhiy's suggestion
> here.
> 
> Serhiy's suggestion covers a slightly different situation, which is
> that we can't warn about the following code snippet in Python 2, even
> though we know bytes objects don't have an encode method in Python 3:
> 
>    "str".encode(encoding)
> 
> The reason is that we can't easily tell the difference between
> something that is correct in both Python 2 & 3 like:
> 
>    "text".encode("utf-8")
> 
> (str->str encoding in Python 2, str->bytes encoding in Python 3)
> 
> and something that will break in Python 3 like:
> 
>    "data".encode("hex")

I don't think that's a problem. The former is legal in both 2.x and 3.x, but it has a different meaning--in 2.x it means "decode with the system encoding, then recode to UTF-8". Unless it's called on a pure-printable-ASCII literal, there's no reason to expect that code to work without changes in 3.x.

(And that's even ignoring the fact that the vast majority of calls to str.encode are bugs, either called on a variable you think is a unicode but is actually a str, or just introduced by a novice throwing in random calls to encode, decode, and str until some exception goes away.)

Whether the encoding is a literal for a unicode->bytes encoding, a literal for a non-3.x-compatible encoding, or a variable whose value can't be guessed statically doesn't really matter; in every case, it's something you need to look at for porting to 3.x. So a warning still seems both doable and worth doing, whether you're talking about a static linter tool or a -3 mode that tracks probably-bytes.

> The single source version of the latter is actually
> 'codecs.encode(b"data", "hex")', but it's quite hard for an analyser
> or test suite to pick that up and recommend the change, as it's hard
> to tell the difference between "str-as-text-object" and
> "str-as-binary-data-object" in Python 2.
> 
> Looking at the way string objects are stored in Python 2, it's
> possible that the ob_sstate field (which tracks the interning status
> of string instances) could potentially be co-opted to hold this
> additional flag. If the new flag was *only* set when "-3" was
> specified, then there'd only be a potential compatibility risk in that
> context (the PyString_CHECK_INTERNED macro currently assumes that a
> non-zero value in ob_sstate always indicates an interned string).
> 
> There'd be a more general performance risk however, as
> PyString_CHECK_INTERNED would also need to be updated to either mask
> out the new "this is probably binary data" state flag unconditionally,
> or else to check the Py3k warning flag and mask out the new flag
> conditionally. Either way, we'd be making "is this interned or not?"
> checks slightly more expensive and the interpreter does a *lot* of
> those.
> 
> Cheers,
> Nick.
> 
> -- 
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/