Re: [Python-ideas] Smoothing transition to Python 3

4 Jun 2016

      On Fri, Jun 03, 2016 at 08:41:14PM -0400, Random832 wrote:
...
What about moving forward to unify the types? For example, we could go
with the Emacs way: A single string abstract type*, which is a sequence
whose elements can be Unicode characters, or raw non-ASCII bytes, all of
which are distinct from each other (Emacs' representation is to assign
the high bytes "code points" between "U+3FFF80" and "U+3FFFFF").
I am fascinated by this concept. I think it might help solve the problem 
of "mixed text and bytes" files, but at the cost of throwing memory at 
it. Reading a file in binary mode would return a sequence of 32-bit 
"Unicode-plus-bytes" code points, rather than 8-bit bytes. Working in 
bytes would be more expensive, unless you happened to be lucky enough to 
only be dealing with bytes with the high bit cleared.

Basically, instead of having two types:

bytes: valid values are \x00 through \xFF;
text:  valid values are U+0000 through U+10FFFF

we'd have one:

text+bytes

interpreted as:

U+0000 through U+007F: bytes, or Unicode, depending on context;
U+0080 through U+10FFFF: only Unicode;
U+3FFF80 through U+3FFFFF: bytes \x80 through \xFF.

Values outside of those ranges are presumably impossible.

I'm not sure what implications there are for codecs.

The downside is that reading from a binary file would give a sequence of 
code points that require four bytes rather than one (except in the 
unusual case that *no* byte had had its high-bit set). Nor could you 
tell the difference between a bunch of ASCII bytes or ASCII text.

But maybe we could live with that?
...
*Emacs has two concrete types: "byte strings" which can contain no
non-ASCII characters, and "unicode strings" which use UTF-8 (plus those
extra code points) underlying representation [various indexing
operations are O(N)]. Python would use the FSR as it is now, along with
perhaps a "byte string" type which likewise can contain only ASCII
characters and high bytes.
Presumably for backwards compatibility we would keep bytes as they are 
now, and either add a new Mixed string type, or modify str to be mixed.

I'm still not entirely sure what the implications for encodings would 
be, but this is a promising idea with respect to mixed text/bytes.

-- 
Steve

Re: [Python-ideas] Smoothing transition to Python 3

Steven D'Aprano