On 26.04.15 06:02, Nick Coghlan wrote:
Serhiy's suggestion covers a slightly different situation, which is that we can't warn about the following code snippet in Python 2, even though we know bytes objects don't have an encode method in Python 3:
"str".encode(encoding)
The reason is that we can't easily tell the difference between something that is correct in both Python 2 & 3 like:
"text".encode("utf-8")
(str->str encoding in Python 2, str->bytes encoding in Python 3)
and something that will break in Python 3 like:
"data".encode("hex")
The single source version of the latter is actually 'codecs.encode(b"data", "hex")', but it's quite hard for an analyser or test suite to pick that up and recommend the change, as it's hard to tell the difference between "str-as-text-object" and "str-as-binary-data-object" in Python 2.
A warning about "data".encode("hex") can be implemented without this change. "hex" is not text encoding, and we can add special flags for text and binary encodings, and emits a warning if binary encoding is used in str.encode() (but not in codecs.encode()). My suggestion covers a case of b"str".encode(encoding)
Looking at the way string objects are stored in Python 2, it's possible that the ob_sstate field (which tracks the interning status of string instances) could potentially be co-opted to hold this additional flag. If the new flag was *only* set when "-3" was specified, then there'd only be a potential compatibility risk in that context (the PyString_CHECK_INTERNED macro currently assumes that a non-zero value in ob_sstate always indicates an interned string).
There'd be a more general performance risk however, as PyString_CHECK_INTERNED would also need to be updated to either mask out the new "this is probably binary data" state flag unconditionally, or else to check the Py3k warning flag and mask out the new flag conditionally. Either way, we'd be making "is this interned or not?" checks slightly more expensive and the interpreter does a *lot* of those.
PyString_CHECK_INTERNED is used only in 8 places in CPython and unconditional masking out shouldn't be too expensive in comparing with surrounding code. But PyString_CHECK_INTERNED can be used in third-party extensions, and this change breaks binary compatibility. So we can't change PyString_CHECK_INTERNED in bugfix release. But we can use a byte after the terminated null byte in a string.