[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]
Stephen J. Turnbull
stephen at xemacs.org
Wed Jan 8 07:04:44 CET 2014
I'm responding here rather than directly to Steven because Andrew
explains it as well as I could. In all cases where I don't comment,
Andrew is 100% correct as to my intended semantics.
The critical point is just that in cases where "the ASCII characters
are themselves" and an 8-bit representation is theoretically possible,
an 8-bit representation is used. More precisely, if the identities of
128-255 as characters is not important to the programmer, these bytes
are not interpreted as characters, in the same way that surrogate-
escaped bytes are uninterpreted in the current representation.
Andrew Barnert writes:
> I think Stephen's name "7-bit" is confusing people.
Indeed, and I apologize for confusing Steven in particular, which is
entirely due to that poor choice.
> If you try to interpret the name sensibly, you get Steven's broken
> interpretation. But if you read it as a nonsense word and work
> through the logic, it all makes sense.
Maybe "ascii-compatible" is better. It's a union type, including all
encodings where octets 0-127 receive the standard mapping to the ASCII
characters, but octets 128-255 are ambiguous.
> > Suppose we take a byte-string with a non-ASCII byte:
> >
> > b'abc\xFF'.decode('ascii-compatible')
> >
> > This will return... what? I think it returns a so-called 7-bit
> > representation, but I'm not sure what it is a representation of.
>
> The representation is the bytes 61 62 63 FF with the floobl flag
> set. It's a representation of an 'a' char, a 'b' char, a 'c' char,
> and a smuggled FF byte--identical to 'abc\uDCFF'.
Except that it's an 8-bit representation invisible to Python except
for maybe the timeit package, yes.
> (This last bit is the part I'm a bit wary of, as it promoted
> surrogate-escape to being an inherent part of the meaning of
> Unicode strings in Python.
They're already part of the inherent meaning of Unicode strings. The
alternative is to read ASCII-compatible streams as latin1, which
*changes their meaning*.
> > Your description confuses me. The "7-bit string" is already text, how do
> > you decode it to the 16-bit internal representation?
>
> By decoding its representation as if it were bytes, using surrogate-escape.
Strictly speaking, it's not a "decoding", it's a change of internal
representation.
> >> 5. String methods that would raise or produce undefined results if
> >> used on str containing surrogate-encoded bytes need to be taught
> >> to do the same on non-ASCII bytes in 7-bit str objects.
> >
> > Do you have an example of such string methods?
No, I don't, but I imagined there might be some. (My original example
was case conversion, but that doesn't work because Python doesn't
check for whether something is actually a code point that can be a
character, even -- it just notices that surrogate-encoded bytes don't
have alternative cases in the database and passes them through.)
> >> 7. On output other codecs raise on a 7-bit str, unless the
> >> surrogateescape handler is in use.
> >
> > What do you mean by "on output"? Do you mean when encoding?
Yes. You (all, but Steven in particular) have my apology for the
imprecision.
> However, I think there's a mistake in the design of 6 here. Surely
> encoding 'abc\uDCFF' should give you the bytes 61 62 63 FF, not an
> exception, right? (Unless the idea is that such a string is
> guaranteed to have a floobl-flagged 8-bit representation, not a
> 16-bit one, no matter how you try to create it in Python or in C,
> and I don't think the other rules make that guarantee.)
Andrew is correct, that is a mistake in design. I thought an 8-bit
representation was guaranteed in that case, with the "floobl" flag
set. I think that Andrew's idea is correct, but this miss makes me
nervous about the coherence of the concept.
More information about the Python-ideas
mailing list