[Python-ideas] Smoothing transition to Python 3
Steven D'Aprano
steve at pearwood.info
Sat Jun 4 05:05:22 EDT 2016
On Fri, Jun 03, 2016 at 08:41:14PM -0400, Random832 wrote:
> What about moving forward to unify the types? For example, we could go
> with the Emacs way: A single string abstract type*, which is a sequence
> whose elements can be Unicode characters, or raw non-ASCII bytes, all of
> which are distinct from each other (Emacs' representation is to assign
> the high bytes "code points" between "U+3FFF80" and "U+3FFFFF").
I am fascinated by this concept. I think it might help solve the problem
of "mixed text and bytes" files, but at the cost of throwing memory at
it. Reading a file in binary mode would return a sequence of 32-bit
"Unicode-plus-bytes" code points, rather than 8-bit bytes. Working in
bytes would be more expensive, unless you happened to be lucky enough to
only be dealing with bytes with the high bit cleared.
Basically, instead of having two types:
bytes: valid values are \x00 through \xFF;
text: valid values are U+0000 through U+10FFFF
we'd have one:
text+bytes
interpreted as:
U+0000 through U+007F: bytes, or Unicode, depending on context;
U+0080 through U+10FFFF: only Unicode;
U+3FFF80 through U+3FFFFF: bytes \x80 through \xFF.
Values outside of those ranges are presumably impossible.
I'm not sure what implications there are for codecs.
The downside is that reading from a binary file would give a sequence of
code points that require four bytes rather than one (except in the
unusual case that *no* byte had had its high-bit set). Nor could you
tell the difference between a bunch of ASCII bytes or ASCII text.
But maybe we could live with that?
> *Emacs has two concrete types: "byte strings" which can contain no
> non-ASCII characters, and "unicode strings" which use UTF-8 (plus those
> extra code points) underlying representation [various indexing
> operations are O(N)]. Python would use the FSR as it is now, along with
> perhaps a "byte string" type which likewise can contain only ASCII
> characters and high bytes.
Presumably for backwards compatibility we would keep bytes as they are
now, and either add a new Mixed string type, or modify str to be mixed.
I'm still not entirely sure what the implications for encodings would
be, but this is a promising idea with respect to mixed text/bytes.
--
Steve
More information about the Python-ideas
mailing list