[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Tue Jan 7 18:46:15 CET 2014

I think Stephen's name "7-bit" is confusing people. If you try to interpret the name sensibly, you get Steven's broken interpretation. But if you read it as a nonsense word and work through the logic, it all makes sense.

On Jan 7, 2014, at 7:44, Steven D'Aprano <steve at pearwood.info> wrote:

> On Tue, Jan 07, 2014 at 03:37:36AM +0900, Stephen J. Turnbull wrote:
> 
>> So ... now that we have the flexible string representation (PEP 393),
>> let's add a 7-bit representation!  (Don't take that too seriously,
>> there are interesting more general variants I'm not going to talk
>> about tonight.)
>> 
>> The 7-bit representation satisfies the following requirements:
>> 
>> 1.  It is only produced on input by a new 'ascii-compatible' codec,
>>    which sets the "7-bit representation" flag in the str object on
>>    input if it encounters any non-ASCII bytes (if pure ASCII, it
>>    produces an 8-bit str object).  This will be slower than just
>>    reading in the bytes in many cases, but I hope not unacceptably so.
> 
> I'm confused by your suggestion here. It seems to me that you've got the 
> conditions backwards. (Or I don't understand them.) Perhaps a couple of 
> examples will make it clear.
> 
> Suppose we take a pure-ASCII byte-string and decode it:
> 
>    b'abcd'.decode('ascii-compatible')
> 
> According to the above, this will produce a regular str object, 'abcd', 
> using the regular 8-bit internal representation, and the "7-bit repr" 
> flag cleared. Correct? (So the flag is *cleared* when all the chars in 
> the string are 7-bit, and *set* when at least one is not. Yes?)

Correct. The floobl representation is not used because there are no non-ASCII bytes.

> Suppose we take a byte-string with a non-ASCII byte:
> 
>    b'abc\xFF'.decode('ascii-compatible')
> 
> This will return... what? I think it returns a so-called 7-bit 
> representation, but I'm not sure what it is a representation of.

The representation is the bytes 61 62 63 FF with the floobl flag set. It's a representation of an 'a' char, a 'b' char, a 'c' char, and a smuggled FF byte--identical to 'abc\uDCFF'.

(This last bit is the part I'm a bit wary of, as it promoted surrogate-escape to being an inherent part of the meaning of Unicode strings in Python. But maybe Stephen has an answer for that. And anyway, it's a much smaller problem than the one you think is there.)

> I 
> presume the internals will actually contain the four bytes
> 
>    61 62 63 FF
> 
> and the "7-bit repr" flag will be set. Is that flag the only difference 
> between these two strings?
> 
>    b'abc\xFF'.decode('ascii-compatible')
>    'abc\xFF'

The floobl flag is the only difference between the two internal representations, but there's a big difference in the meaning.

> Presumably they will compare equal, yes?

I would hope not. One of them has the Unicode character U+FF, the other has smuggled byte 0xFF, so they'd better not compare equal.

However, the latter should compare equal to 'abc\uDCFF'. That's the entire key here: the new representation is nothing but a more compact way to represent strings that contain nothing but ASCII and surrogate escapes.

>> 2.  When sliced, the result needs to be checked for non-ASCII bytes.
>>    If none, the result is promoted to 8-bit.
>> 
>> 3.  When combined with a str in 8-bit representation:
>> 
>>    a.  If the 8-bit str contains any Latin-1 or C1 characters, both
>>        strs are promoted to 16-bit, and non-ASCII characters in the
>>        7-bit string are converted by the surrogateescape handler.
>> 
>>    b.  Otherwise they're combined into a 7-bit str.
> 
> 
> A concrete example:
> 
>    s = b'abcd'.decode('ascii-compatible')
>    t = 'x'  # ASCII-compatible
>    s + t
>    => returns 'abcdx', with the "7-bit repr" flag cleared.

Right. Here both s and t are normal 8-bit strings reprs in the first place, so the new logic doesn't even get invoked. So yes, that's what it returns.

>    s = b'abcd'.decode('ascii-compatible')
>    t = 'ÿ'  # U+00FF, non-ASCII.
> 
>    s + t
>    => returns 'abcd\uDCFF', with the "7-bit repr" flag set

No, you've missed two key bits here. 

First, you're again adding two regular 8-bit-repr strings, not a non-ASCII-smuggling string plus an 8-bit, so the new logic doesn't get invoked at all.

Plus, even if s were a 7-bit-flagged string like 'ab\xfe'.decode('ascii-compatible'), that wouldn't turn t into \uDCFF. Only bytes in the floobl-flagged string are surrogate-escaped; characters in the normal string are handled normally. So you'd have 'ab\uDCFE\xFF'.

Also, both strings are promoted to 16-bit, and the floobl flag is never set with 16-bit or 32-bit representations.

> The \uDCFF at the end is the ÿ encoded with the surrogateescape error 
> handler.
> 
> There's a problem with this: two strings, visually indistinguishable, 
> but differing only in the internal representation, give completely 
> different results:
> 
>    b'abcd'.decode('ascii') + 'ÿ'
>    => 'abcd\u00FF'
> 
>    b'abcd'.decode('ascii-compatible') + 'ÿ'
>    => 'abcd\uDCFF'

Nope, again, these both give the first result.

>> 4.  When combined with a str in 16-bit or 32-bit representation, the
>>    7-bit string is "decoded" to the same representation, as if using
>>    the 'ascii' codec with the 'surrogateescape' handler.
> 
> Another example:
> 
>    s = b'abcd'.decode('ascii-compatible')
>    assert s = 'abcd'
>    s + 'π'
>    => returns what?

'abcdπ'. Since the first one is a plain 8-bit string, and the second a plain 16-bit string, the new logic never even gets involved. 

And again, if you change this so s is b'abc\xFE'.decode('ascii-compatible'), then you're adding a floobl string and a 16-bit string, so the FE byte gets encoded as DCFE, while the pi character is left unchanged, so you get 'abc\uDCFEπ'.

> Your description confuses me. The "7-bit string" is already text, how do 
> you decode it to the 16-bit internal representation? 

By decoding its representation as if it were bytes, using surrogate-escape.

>> 5.  String methods that would raise or produce undefined results if
>>    used on str containing surrogate-encoded bytes need to be taught
>>    to do the same on non-ASCII bytes in 7-bit str objects.
> 
> Do you have an example of such string methods?
> 
> 
>> 6.  On output the 'ascii-compatible' codec simply memcpy's 7-bit str
>>    and pure ASCII 8-bit str, and raises on anything else.  (Sorry,
>>    no, ISO 8859-1 does *not* get passed through without exception.)
>> 
>> 7.  On output other codecs raise on a 7-bit str, unless the
>>    surrogateescape handler is in use.
> 
> What do you mean by "on output"? Do you mean when encoding?

Presumably "output" means something like writing to a TextIOWrapper whose encoding whose codec is ascii-compatible. In which case you're right, it would be clearer to just say "when encoding".

However, I think there's a mistake in the design of 6 here. Surely encoding 'abc\uDCFF' should give you the bytes 61 62 63 FF, not an exception, right? (Unless the idea is that such a string is guaranteed to have a floobl-flagged 8-bit representation, not a 16-bit one, no matter how you try to create it in Python or in C, and I don't think the other rules make that guarantee.)

> 
> This concerns me:
> 
>    b'abcd'.decode('ascii').encode('latin-1')
>    => returns b'abcd'
> 
>    b'abcd'.decode('ascii-compatible').encode('latin-1')
>    => raises

Nope. The decoding returns the string 'abcd', in normal 8-bit representation, in both cases. There are no non-ASCII bytes, so the floobl flag isn't set. So you get the same result either way.

> And yet, the two 'abcd' strings you get are visually indistinguishable, 
> and only differ by a hidden, internal flag.
> 
> I've probably misunderstood something about your proposal, so please 
> explain where I've gone wrong. Please give examples!
> 
> 
> -- 
> Steven
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/