Greg Ewing wrote:
It seems to me that what surrogateescape is effectively doing is creating a new data type that consists of a mixture of ASCII characters and raw bytes, and enables you to tell which is which.
How so? Sounds like this new data type assumes everything over 127 is a raw byte, but there are plenty of applications where values between 0 - 127 should be interpreted as raw bytes even when the majority are indeed just plain ascii.
Maybe there should be a real data type like this, or a flag on the unicode type. The data would be stored in the same way as a latin1-decoded string, but anything with the high bit set would be regarded as a byte instead of a character. This might make it easier to interoperate with external libraries that expect well-formed unicode.
I can see a data type that is easier to work with than bytes (ascii-string, anybody? ;) but I don't think we want to make it any kind of unicode -- once the text has been extracted from this ascii-string it should be converted to unicode for further processing, while any other non-convertible bytes should stay as bytes (or ascii-string, or whatever we call it). The above is not arguing with the 'latin-1' nor 'surrogateescape' techniques, but only commenting on a different data type with probably different uses. ~Ethan~