[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]
Nick Coghlan
ncoghlan at gmail.com
Tue Jan 7 17:19:09 CET 2014
On 7 Jan 2014 23:45, "Steven D'Aprano" <steve at pearwood.info> wrote:
>
> On Tue, Jan 07, 2014 at 03:37:36AM +0900, Stephen J. Turnbull wrote:
>
> > So ... now that we have the flexible string representation (PEP 393),
> > let's add a 7-bit representation! (Don't take that too seriously,
> > there are interesting more general variants I'm not going to talk
> > about tonight.)
> >
> > The 7-bit representation satisfies the following requirements:
> >
> > 1. It is only produced on input by a new 'ascii-compatible' codec,
> > which sets the "7-bit representation" flag in the str object on
> > input if it encounters any non-ASCII bytes (if pure ASCII, it
> > produces an 8-bit str object). This will be slower than just
> > reading in the bytes in many cases, but I hope not unacceptably so.
>
> I'm confused by your suggestion here. It seems to me that you've got the
> conditions backwards. (Or I don't understand them.) Perhaps a couple of
> examples will make it clear.
>
> Suppose we take a pure-ASCII byte-string and decode it:
>
> b'abcd'.decode('ascii-compatible')
>
> According to the above, this will produce a regular str object, 'abcd',
> using the regular 8-bit internal representation, and the "7-bit repr"
> flag cleared. Correct? (So the flag is *cleared* when all the chars in
> the string are 7-bit, and *set* when at least one is not. Yes?)
>
> Suppose we take a byte-string with a non-ASCII byte:
>
> b'abc\xFF'.decode('ascii-compatible')
>
> This will return... what? I think it returns a so-called 7-bit
> representation, but I'm not sure what it is a representation of. I
> presume the internals will actually contain the four bytes
>
> 61 62 63 FF
>
> and the "7-bit repr" flag will be set. Is that flag the only difference
> between these two strings?
>
> b'abc\xFF'.decode('ascii-compatible')
> 'abc\xFF'
>
> Presumably they will compare equal, yes?
>
>
> > 2. When sliced, the result needs to be checked for non-ASCII bytes.
> > If none, the result is promoted to 8-bit.
> >
> > 3. When combined with a str in 8-bit representation:
> >
> > a. If the 8-bit str contains any Latin-1 or C1 characters, both
> > strs are promoted to 16-bit, and non-ASCII characters in the
> > 7-bit string are converted by the surrogateescape handler.
> >
> > b. Otherwise they're combined into a 7-bit str.
>
>
> A concrete example:
>
> s = b'abcd'.decode('ascii-compatible')
> t = 'x' # ASCII-compatible
> s + t
> => returns 'abcdx', with the "7-bit repr" flag cleared.
>
>
> s = b'abcd'.decode('ascii-compatible')
> t = 'ÿ' # U+00FF, non-ASCII.
>
> s + t
> => returns 'abcd\uDCFF', with the "7-bit repr" flag set
>
> The \uDCFF at the end is the ÿ encoded with the surrogateescape error
> handler.
>
> There's a problem with this: two strings, visually indistinguishable,
> but differing only in the internal representation, give completely
> different results:
>
> b'abcd'.decode('ascii') + 'ÿ'
> => 'abcd\u00FF'
>
> b'abcd'.decode('ascii-compatible') + 'ÿ'
> => 'abcd\uDCFF'
>
>
> > 4. When combined with a str in 16-bit or 32-bit representation, the
> > 7-bit string is "decoded" to the same representation, as if using
> > the 'ascii' codec with the 'surrogateescape' handler.
>
> Another example:
>
> s = b'abcd'.decode('ascii-compatible')
> assert s = 'abcd'
> s + 'π'
> => returns what?
>
> Your description confuses me. The "7-bit string" is already text, how do
> you decode it to the 16-bit internal representation?
>
>
> > 5. String methods that would raise or produce undefined results if
> > used on str containing surrogate-encoded bytes need to be taught
> > to do the same on non-ASCII bytes in 7-bit str objects.
>
> Do you have an example of such string methods?
>
>
> > 6. On output the 'ascii-compatible' codec simply memcpy's 7-bit str
> > and pure ASCII 8-bit str, and raises on anything else. (Sorry,
> > no, ISO 8859-1 does *not* get passed through without exception.)
> >
> > 7. On output other codecs raise on a 7-bit str, unless the
> > surrogateescape handler is in use.
>
> What do you mean by "on output"? Do you mean when encoding?
>
> This concerns me:
>
> b'abcd'.decode('ascii').encode('latin-1')
> => returns b'abcd'
>
> b'abcd'.decode('ascii-compatible').encode('latin-1')
> => raises
>
> And yet, the two 'abcd' strings you get are visually indistinguishable,
> and only differ by a hidden, internal flag.
>
> I've probably misunderstood something about your proposal, so please
> explain where I've gone wrong. Please give examples!
I haven't been following the discussion in detail (linux.conf.au and the
Py3 discussions have most of my attention this week), but I'm definitely
not clear on how this 7-bit proposal differs meaningfully from just using
ascii with the surrogateescape error handler.
Cheers,
Nick.
>
>
> --
> Steven
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140108/71d7d472/attachment-0001.html>
More information about the Python-ideas
mailing list