<p dir="ltr"><br>

On 7 Jan 2014 23:45, "Steven D'Aprano" <<a href="mailto:steve@pearwood.info">steve@pearwood.info</a>> wrote:<br>

><br>

> On Tue, Jan 07, 2014 at 03:37:36AM +0900, Stephen J. Turnbull wrote:<br>

><br>

> > So ... now that we have the flexible string representation (PEP 393),<br>

> > let's add a 7-bit representation!  (Don't take that too seriously,<br>

> > there are interesting more general variants I'm not going to talk<br>

> > about tonight.)<br>

> ><br>

> > The 7-bit representation satisfies the following requirements:<br>

> ><br>

> > 1.  It is only produced on input by a new 'ascii-compatible' codec,<br>

> >     which sets the "7-bit representation" flag in the str object on<br>

> >     input if it encounters any non-ASCII bytes (if pure ASCII, it<br>

> >     produces an 8-bit str object).  This will be slower than just<br>

> >     reading in the bytes in many cases, but I hope not unacceptably so.<br>

><br>

> I'm confused by your suggestion here. It seems to me that you've got the<br>

> conditions backwards. (Or I don't understand them.) Perhaps a couple of<br>

> examples will make it clear.<br>

><br>

> Suppose we take a pure-ASCII byte-string and decode it:<br>

><br>

>     b'abcd'.decode('ascii-compatible')<br>

><br>

> According to the above, this will produce a regular str object, 'abcd',<br>

> using the regular 8-bit internal representation, and the "7-bit repr"<br>

> flag cleared. Correct? (So the flag is *cleared* when all the chars in<br>

> the string are 7-bit, and *set* when at least one is not. Yes?)<br>

><br>

> Suppose we take a byte-string with a non-ASCII byte:<br>

><br>

>     b'abc\xFF'.decode('ascii-compatible')<br>

><br>

> This will return... what? I think it returns a so-called 7-bit<br>

> representation, but I'm not sure what it is a representation of. I<br>

> presume the internals will actually contain the four bytes<br>

><br>

>     61 62 63 FF<br>

><br>

> and the "7-bit repr" flag will be set. Is that flag the only difference<br>

> between these two strings?<br>

><br>

>     b'abc\xFF'.decode('ascii-compatible')<br>

>     'abc\xFF'<br>

><br>

> Presumably they will compare equal, yes?<br>

><br>

><br>

> > 2.  When sliced, the result needs to be checked for non-ASCII bytes.<br>

> >     If none, the result is promoted to 8-bit.<br>

> ><br>

> > 3.  When combined with a str in 8-bit representation:<br>

> ><br>

> >     a.  If the 8-bit str contains any Latin-1 or C1 characters, both<br>

> >         strs are promoted to 16-bit, and non-ASCII characters in the<br>

> >         7-bit string are converted by the surrogateescape handler.<br>

> ><br>

> >     b.  Otherwise they're combined into a 7-bit str.<br>

><br>

><br>

> A concrete example:<br>

><br>

>     s = b'abcd'.decode('ascii-compatible')<br>

>     t = 'x'  # ASCII-compatible<br>

>     s + t<br>

>     => returns 'abcdx', with the "7-bit repr" flag cleared.<br>

><br>

><br>

>     s = b'abcd'.decode('ascii-compatible')<br>

>     t = 'ÿ'  # U+00FF, non-ASCII.<br>

><br>

>     s + t<br>

>     => returns 'abcd\uDCFF', with the "7-bit repr" flag set<br>

><br>

> The \uDCFF at the end is the ÿ encoded with the surrogateescape error<br>

> handler.<br>

><br>

> There's a problem with this: two strings, visually indistinguishable,<br>

> but differing only in the internal representation, give completely<br>

> different results:<br>

><br>

>     b'abcd'.decode('ascii') + 'ÿ'<br>

>     => 'abcd\u00FF'<br>

><br>

>     b'abcd'.decode('ascii-compatible') + 'ÿ'<br>

>     => 'abcd\uDCFF'<br>

><br>

><br>

> > 4.  When combined with a str in 16-bit or 32-bit representation, the<br>

> >     7-bit string is "decoded" to the same representation, as if using<br>

> >     the 'ascii' codec with the 'surrogateescape' handler.<br>

><br>

> Another example:<br>

><br>

>     s = b'abcd'.decode('ascii-compatible')<br>

>     assert s = 'abcd'<br>

>     s + 'π'<br>

>     => returns what?<br>

><br>

> Your description confuses me. The "7-bit string" is already text, how do<br>

> you decode it to the 16-bit internal representation?<br>

><br>

><br>

> > 5.  String methods that would raise or produce undefined results if<br>

> >     used on str containing surrogate-encoded bytes need to be taught<br>

> >     to do the same on non-ASCII bytes in 7-bit str objects.<br>

><br>

> Do you have an example of such string methods?<br>

><br>

><br>

> > 6.  On output the 'ascii-compatible' codec simply memcpy's 7-bit str<br>

> >     and pure ASCII 8-bit str, and raises on anything else.  (Sorry,<br>

> >     no, ISO 8859-1 does *not* get passed through without exception.)<br>

> ><br>

> > 7.  On output other codecs raise on a 7-bit str, unless the<br>

> >     surrogateescape handler is in use.<br>

><br>

> What do you mean by "on output"? Do you mean when encoding?<br>

><br>

> This concerns me:<br>

><br>

>     b'abcd'.decode('ascii').encode('latin-1')<br>

>     => returns b'abcd'<br>

><br>

>     b'abcd'.decode('ascii-compatible').encode('latin-1')<br>

>     => raises<br>

><br>

> And yet, the two 'abcd' strings you get are visually indistinguishable,<br>

> and only differ by a hidden, internal flag.<br>

><br>

> I've probably misunderstood something about your proposal, so please<br>

> explain where I've gone wrong. Please give examples!</p>

<p dir="ltr">I haven't been following the discussion in detail (<a href="http://linux.conf.au">linux.conf.au</a> and the Py3 discussions have most of my attention this week), but I'm definitely not clear on how this 7-bit proposal differs meaningfully from just using ascii with the surrogateescape error handler.</p>


<p dir="ltr">Cheers,<br>

Nick.</p>

<p dir="ltr">><br>

><br>

> --<br>

> Steven<br>

> _______________________________________________<br>

> Python-ideas mailing list<br>

> <a href="mailto:Python-ideas@python.org">Python-ideas@python.org</a><br>

> <a href="https://mail.python.org/mailman/listinfo/python-ideas">https://mail.python.org/mailman/listinfo/python-ideas</a><br>

> Code of Conduct: <a href="http://python.org/psf/codeofconduct/">http://python.org/psf/codeofconduct/</a></p>