On 26 April 2015 at 05:27, Markus Unterwaditzer email@example.com wrote:
On Sat, Apr 25, 2015 at 07:14:57PM +0300, Serhiy Storchaka wrote:
Here is an idea that perhaps will help to prepare Python 2 code for converting to Python 3.
Currently bytes is just an alias of str in Python 2, and the "b" prefix of string literals is ignored. There are no differences between natural strings and bytes. I propose to add special bit to str instances and set it for bytes literals and strings created from binary sources (read from binary files, received from sockets, the result of unicode.encode() and struct.pack(), etc). With -3 flag operations with binary strings that doesn't allowed for bytes in Python 3 (e.g. encoding or coercing to unicode) will emit a warning. Unfortunately we can't change the bytes constructor in minor version, it should left an alias to str in 2.7. So the result of bytes() will be not tagged as binary string.
May be it is too late for this.
You can get similar kinds of warnings with unicode-nazi (https://github.com/mitsuhiko/unicode-nazi), so I'm not sure if this would be that helpful.
Mentioning that utility in the porting guide could potentially be useful, but I don't think it's a substitute for Serhiy's suggestion here.
Serhiy's suggestion covers a slightly different situation, which is that we can't warn about the following code snippet in Python 2, even though we know bytes objects don't have an encode method in Python 3:
The reason is that we can't easily tell the difference between something that is correct in both Python 2 & 3 like:
(str->str encoding in Python 2, str->bytes encoding in Python 3)
and something that will break in Python 3 like:
The single source version of the latter is actually 'codecs.encode(b"data", "hex")', but it's quite hard for an analyser or test suite to pick that up and recommend the change, as it's hard to tell the difference between "str-as-text-object" and "str-as-binary-data-object" in Python 2.
Looking at the way string objects are stored in Python 2, it's possible that the ob_sstate field (which tracks the interning status of string instances) could potentially be co-opted to hold this additional flag. If the new flag was *only* set when "-3" was specified, then there'd only be a potential compatibility risk in that context (the PyString_CHECK_INTERNED macro currently assumes that a non-zero value in ob_sstate always indicates an interned string).
There'd be a more general performance risk however, as PyString_CHECK_INTERNED would also need to be updated to either mask out the new "this is probably binary data" state flag unconditionally, or else to check the Py3k warning flag and mask out the new flag conditionally. Either way, we'd be making "is this interned or not?" checks slightly more expensive and the interpreter does a *lot* of those.