[Python-ideas] Python 3000 TIOBE -3%

Ethan Furman ethan at stoneleaf.us
Thu Feb 16 16:21:23 CET 2012


Greg Ewing wrote:
> It seems to me that what surrogateescape is effectively doing is
> creating a new data type that consists of a mixture of ASCII
> characters and raw bytes, and enables you to tell which is which.

How so?  Sounds like this new data type assumes everything over 127 is a 
raw byte, but there are plenty of applications where values between 0 - 
127 should be interpreted as raw bytes even when the majority are indeed 
just plain ascii.


> Maybe there should be a real data type like this, or a flag on
> the unicode type. The data would be stored in the same way as a
> latin1-decoded string, but anything with the high bit set would
> be regarded as a byte instead of a character. This might make it
> easier to interoperate with external libraries that expect
> well-formed unicode.

I can see a data type that is easier to work with than bytes 
(ascii-string, anybody? ;) but I don't think we want to make it any kind 
of unicode -- once the text has been extracted from this ascii-string it 
should be converted to unicode for further processing, while any other 
non-convertible bytes should stay as bytes (or ascii-string, or whatever 
we call it).

The above is not arguing with the 'latin-1' nor 'surrogateescape' 
techniques, but only commenting on a different data type with probably 
different uses.

~Ethan~



More information about the Python-ideas mailing list