[Python-ideas] Smoothing transition to Python 3

Sat Jun 4 05:05:22 EDT 2016

On Fri, Jun 03, 2016 at 08:41:14PM -0400, Random832 wrote:

> What about moving forward to unify the types? For example, we could go
> with the Emacs way: A single string abstract type*, which is a sequence
> whose elements can be Unicode characters, or raw non-ASCII bytes, all of
> which are distinct from each other (Emacs' representation is to assign
> the high bytes "code points" between "U+3FFF80" and "U+3FFFFF").

I am fascinated by this concept. I think it might help solve the problem 
of "mixed text and bytes" files, but at the cost of throwing memory at 
it. Reading a file in binary mode would return a sequence of 32-bit 
"Unicode-plus-bytes" code points, rather than 8-bit bytes. Working in 
bytes would be more expensive, unless you happened to be lucky enough to 
only be dealing with bytes with the high bit cleared.

Basically, instead of having two types:

bytes: valid values are \x00 through \xFF;
text:  valid values are U+0000 through U+10FFFF

we'd have one:

text+bytes

interpreted as:

U+0000 through U+007F: bytes, or Unicode, depending on context;
U+0080 through U+10FFFF: only Unicode;
U+3FFF80 through U+3FFFFF: bytes \x80 through \xFF.

Values outside of those ranges are presumably impossible.

I'm not sure what implications there are for codecs.

The downside is that reading from a binary file would give a sequence of 
code points that require four bytes rather than one (except in the 
unusual case that *no* byte had had its high-bit set). Nor could you 
tell the difference between a bunch of ASCII bytes or ASCII text.

But maybe we could live with that?

> *Emacs has two concrete types: "byte strings" which can contain no
> non-ASCII characters, and "unicode strings" which use UTF-8 (plus those
> extra code points) underlying representation [various indexing
> operations are O(N)]. Python would use the FSR as it is now, along with
> perhaps a "byte string" type which likewise can contain only ASCII
> characters and high bytes.

Presumably for backwards compatibility we would keep bytes as they are 
now, and either add a new Mixed string type, or modify str to be mixed.

I'm still not entirely sure what the implications for encodings would 
be, but this is a promising idea with respect to mixed text/bytes.

-- 
Steve