[Python-ideas] Python 3000 TIOBE -3%
Ethan Furman
ethan at stoneleaf.us
Thu Feb 16 16:21:23 CET 2012
Greg Ewing wrote:
> It seems to me that what surrogateescape is effectively doing is
> creating a new data type that consists of a mixture of ASCII
> characters and raw bytes, and enables you to tell which is which.
How so? Sounds like this new data type assumes everything over 127 is a
raw byte, but there are plenty of applications where values between 0 -
127 should be interpreted as raw bytes even when the majority are indeed
just plain ascii.
> Maybe there should be a real data type like this, or a flag on
> the unicode type. The data would be stored in the same way as a
> latin1-decoded string, but anything with the high bit set would
> be regarded as a byte instead of a character. This might make it
> easier to interoperate with external libraries that expect
> well-formed unicode.
I can see a data type that is easier to work with than bytes
(ascii-string, anybody? ;) but I don't think we want to make it any kind
of unicode -- once the text has been extracted from this ascii-string it
should be converted to unicode for further processing, while any other
non-convertible bytes should stay as bytes (or ascii-string, or whatever
we call it).
The above is not arguing with the 'latin-1' nor 'surrogateescape'
techniques, but only commenting on a different data type with probably
different uses.
~Ethan~
More information about the Python-ideas
mailing list