On Fri, Jun 03, 2016 at 08:41:14PM -0400, Random832 wrote:
What about moving forward to unify the types? For example, we could go with the Emacs way: A single string abstract type*, which is a sequence whose elements can be Unicode characters, or raw non-ASCII bytes, all of which are distinct from each other (Emacs' representation is to assign the high bytes "code points" between "U+3FFF80" and "U+3FFFFF").
I am fascinated by this concept. I think it might help solve the problem of "mixed text and bytes" files, but at the cost of throwing memory at it. Reading a file in binary mode would return a sequence of 32-bit "Unicode-plus-bytes" code points, rather than 8-bit bytes. Working in bytes would be more expensive, unless you happened to be lucky enough to only be dealing with bytes with the high bit cleared. Basically, instead of having two types: bytes: valid values are \x00 through \xFF; text: valid values are U+0000 through U+10FFFF we'd have one: text+bytes interpreted as: U+0000 through U+007F: bytes, or Unicode, depending on context; U+0080 through U+10FFFF: only Unicode; U+3FFF80 through U+3FFFFF: bytes \x80 through \xFF. Values outside of those ranges are presumably impossible. I'm not sure what implications there are for codecs. The downside is that reading from a binary file would give a sequence of code points that require four bytes rather than one (except in the unusual case that *no* byte had had its high-bit set). Nor could you tell the difference between a bunch of ASCII bytes or ASCII text. But maybe we could live with that?
*Emacs has two concrete types: "byte strings" which can contain no non-ASCII characters, and "unicode strings" which use UTF-8 (plus those extra code points) underlying representation [various indexing operations are O(N)]. Python would use the FSR as it is now, along with perhaps a "byte string" type which likewise can contain only ASCII characters and high bytes.
Presumably for backwards compatibility we would keep bytes as they are now, and either add a new Mixed string type, or modify str to be mixed. I'm still not entirely sure what the implications for encodings would be, but this is a promising idea with respect to mixed text/bytes. -- Steve