[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Wed Jan 8 07:04:44 CET 2014

I'm responding here rather than directly to Steven because Andrew
explains it as well as I could.  In all cases where I don't comment,
Andrew is 100% correct as to my intended semantics.

The critical point is just that in cases where "the ASCII characters
are themselves" and an 8-bit representation is theoretically possible,
an 8-bit representation is used.  More precisely, if the identities of
128-255 as characters is not important to the programmer, these bytes
are not interpreted as characters, in the same way that surrogate-
escaped bytes are uninterpreted in the current representation.

Andrew Barnert writes:

 > I think Stephen's name "7-bit" is confusing people.

Indeed, and I apologize for confusing Steven in particular, which is
entirely due to that poor choice.

 > If you try to interpret the name sensibly, you get Steven's broken
 > interpretation. But if you read it as a nonsense word and work
 > through the logic, it all makes sense.

Maybe "ascii-compatible" is better.  It's a union type, including all
encodings where octets 0-127 receive the standard mapping to the ASCII
characters, but octets 128-255 are ambiguous.

 > > Suppose we take a byte-string with a non-ASCII byte:
 > > 
 > >    b'abc\xFF'.decode('ascii-compatible')
 > > 
 > > This will return... what? I think it returns a so-called 7-bit 
 > > representation, but I'm not sure what it is a representation of.
 > 
 > The representation is the bytes 61 62 63 FF with the floobl flag
 > set. It's a representation of an 'a' char, a 'b' char, a 'c' char,
 > and a smuggled FF byte--identical to 'abc\uDCFF'.

Except that it's an 8-bit representation invisible to Python except
for maybe the timeit package, yes.

 > (This last bit is the part I'm a bit wary of, as it promoted
 > surrogate-escape to being an inherent part of the meaning of
 > Unicode strings in Python.

They're already part of the inherent meaning of Unicode strings.  The
alternative is to read ASCII-compatible streams as latin1, which
*changes their meaning*.

 > > Your description confuses me. The "7-bit string" is already text, how do 
 > > you decode it to the 16-bit internal representation? 
 > 
 > By decoding its representation as if it were bytes, using surrogate-escape.

Strictly speaking, it's not a "decoding", it's a change of internal
representation.

 > >> 5.  String methods that would raise or produce undefined results if
 > >>    used on str containing surrogate-encoded bytes need to be taught
 > >>    to do the same on non-ASCII bytes in 7-bit str objects.
 > > 
 > > Do you have an example of such string methods?

No, I don't, but I imagined there might be some.  (My original example
was case conversion, but that doesn't work because Python doesn't
check for whether something is actually a code point that can be a
character, even -- it just notices that surrogate-encoded bytes don't
have alternative cases in the database and passes them through.)

 > >> 7.  On output other codecs raise on a 7-bit str, unless the
 > >>    surrogateescape handler is in use.
 > > 
 > > What do you mean by "on output"? Do you mean when encoding?

Yes.  You (all, but Steven in particular) have my apology for the
imprecision.

 > However, I think there's a mistake in the design of 6 here. Surely
 > encoding 'abc\uDCFF' should give you the bytes 61 62 63 FF, not an
 > exception, right? (Unless the idea is that such a string is
 > guaranteed to have a floobl-flagged 8-bit representation, not a
 > 16-bit one, no matter how you try to create it in Python or in C,
 > and I don't think the other rules make that guarantee.)

Andrew is correct, that is a mistake in design.  I thought an 8-bit
representation was guaranteed in that case, with the "floobl" flag
set.  I think that Andrew's idea is correct, but this miss makes me
nervous about the coherence of the concept.