[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Tue Jan 7 16:44:03 CET 2014

On Tue, Jan 07, 2014 at 03:37:36AM +0900, Stephen J. Turnbull wrote:

> So ... now that we have the flexible string representation (PEP 393),
> let's add a 7-bit representation!  (Don't take that too seriously,
> there are interesting more general variants I'm not going to talk
> about tonight.)
> 
> The 7-bit representation satisfies the following requirements:
> 
> 1.  It is only produced on input by a new 'ascii-compatible' codec,
>     which sets the "7-bit representation" flag in the str object on
>     input if it encounters any non-ASCII bytes (if pure ASCII, it
>     produces an 8-bit str object).  This will be slower than just
>     reading in the bytes in many cases, but I hope not unacceptably so.

I'm confused by your suggestion here. It seems to me that you've got the 
conditions backwards. (Or I don't understand them.) Perhaps a couple of 
examples will make it clear.

Suppose we take a pure-ASCII byte-string and decode it:

    b'abcd'.decode('ascii-compatible')

According to the above, this will produce a regular str object, 'abcd', 
using the regular 8-bit internal representation, and the "7-bit repr" 
flag cleared. Correct? (So the flag is *cleared* when all the chars in 
the string are 7-bit, and *set* when at least one is not. Yes?)

Suppose we take a byte-string with a non-ASCII byte:

    b'abc\xFF'.decode('ascii-compatible')

This will return... what? I think it returns a so-called 7-bit 
representation, but I'm not sure what it is a representation of. I 
presume the internals will actually contain the four bytes

    61 62 63 FF

and the "7-bit repr" flag will be set. Is that flag the only difference 
between these two strings?

    b'abc\xFF'.decode('ascii-compatible')
    'abc\xFF'

Presumably they will compare equal, yes?

> 2.  When sliced, the result needs to be checked for non-ASCII bytes.
>     If none, the result is promoted to 8-bit.
> 
> 3.  When combined with a str in 8-bit representation:
> 
>     a.  If the 8-bit str contains any Latin-1 or C1 characters, both
>         strs are promoted to 16-bit, and non-ASCII characters in the
>         7-bit string are converted by the surrogateescape handler.
> 
>     b.  Otherwise they're combined into a 7-bit str.

A concrete example:

    s = b'abcd'.decode('ascii-compatible')
    t = 'x'  # ASCII-compatible
    s + t
    => returns 'abcdx', with the "7-bit repr" flag cleared.

    s = b'abcd'.decode('ascii-compatible')
    t = 'ÿ'  # U+00FF, non-ASCII.

    s + t
    => returns 'abcd\uDCFF', with the "7-bit repr" flag set

The \uDCFF at the end is the ÿ encoded with the surrogateescape error 
handler.

There's a problem with this: two strings, visually indistinguishable, 
but differing only in the internal representation, give completely 
different results:

    b'abcd'.decode('ascii') + 'ÿ'
    => 'abcd\u00FF'

    b'abcd'.decode('ascii-compatible') + 'ÿ'
    => 'abcd\uDCFF'

> 4.  When combined with a str in 16-bit or 32-bit representation, the
>     7-bit string is "decoded" to the same representation, as if using
>     the 'ascii' codec with the 'surrogateescape' handler.

Another example:

    s = b'abcd'.decode('ascii-compatible')
    assert s = 'abcd'
    s + 'π'
    => returns what?

Your description confuses me. The "7-bit string" is already text, how do 
you decode it to the 16-bit internal representation? 

> 5.  String methods that would raise or produce undefined results if
>     used on str containing surrogate-encoded bytes need to be taught
>     to do the same on non-ASCII bytes in 7-bit str objects.

Do you have an example of such string methods?

> 6.  On output the 'ascii-compatible' codec simply memcpy's 7-bit str
>     and pure ASCII 8-bit str, and raises on anything else.  (Sorry,
>     no, ISO 8859-1 does *not* get passed through without exception.)
> 
> 7.  On output other codecs raise on a 7-bit str, unless the
>     surrogateescape handler is in use.

What do you mean by "on output"? Do you mean when encoding?

This concerns me:

    b'abcd'.decode('ascii').encode('latin-1')
    => returns b'abcd'

    b'abcd'.decode('ascii-compatible').encode('latin-1')
    => raises

And yet, the two 'abcd' strings you get are visually indistinguishable, 
and only differ by a hidden, internal flag.

I've probably misunderstood something about your proposal, so please 
explain where I've gone wrong. Please give examples!

-- 
Steven