[Python-ideas] Bytestrings in Python 2

Sun Apr 26 07:38:23 CEST 2015

On 26.04.15 06:02, Nick Coghlan wrote:
> Serhiy's suggestion covers a slightly different situation, which is
> that we can't warn about the following code snippet in Python 2, even
> though we know bytes objects don't have an encode method in Python 3:
>
>      "str".encode(encoding)
>
> The reason is that we can't easily tell the difference between
> something that is correct in both Python 2 & 3 like:
>
>      "text".encode("utf-8")
>
> (str->str encoding in Python 2, str->bytes encoding in Python 3)
>
> and something that will break in Python 3 like:
>
>      "data".encode("hex")
>
> The single source version of the latter is actually
> 'codecs.encode(b"data", "hex")', but it's quite hard for an analyser
> or test suite to pick that up and recommend the change, as it's hard
> to tell the difference between "str-as-text-object" and
> "str-as-binary-data-object" in Python 2.

A warning about "data".encode("hex") can be implemented without this 
change. "hex" is not text encoding, and we can add special flags for 
text and binary encodings, and emits a warning if binary encoding is 
used in str.encode() (but not in codecs.encode()).

My suggestion covers a case of

      b"str".encode(encoding)

> Looking at the way string objects are stored in Python 2, it's
> possible that the ob_sstate field (which tracks the interning status
> of string instances) could potentially be co-opted to hold this
> additional flag. If the new flag was *only* set when "-3" was
> specified, then there'd only be a potential compatibility risk in that
> context (the PyString_CHECK_INTERNED macro currently assumes that a
> non-zero value in ob_sstate always indicates an interned string).
>
> There'd be a more general performance risk however, as
> PyString_CHECK_INTERNED would also need to be updated to either mask
> out the new "this is probably binary data" state flag unconditionally,
> or else to check the Py3k warning flag and mask out the new flag
> conditionally. Either way, we'd be making "is this interned or not?"
> checks slightly more expensive and the interpreter does a *lot* of
> those.

PyString_CHECK_INTERNED is used only in 8 places in CPython and 
unconditional masking out shouldn't be too expensive in comparing with 
surrounding code. But PyString_CHECK_INTERNED can be used in third-party 
extensions, and this change breaks binary compatibility. So we can't 
change PyString_CHECK_INTERNED in bugfix release. But we can use a byte 
after the terminated null byte in a string.