
Guido van Rossum wrote:
...
I've heard a few people claim that strings should always be considered to contain "characters" and that there should be one character per string element. I've also heard a clamoring that there should only be one string type. You folks have never used Asian encodings. In countries like Japan, China and Korea, encodings are a fact of life, and the most popular encodings are ASCII supersets that use a variable number of bytes per character, just like UTF-8. Each country or language uses different encodings, even though their characters look mostly the same to western eyes. UTF-8 and Unicode is having a hard time getting adopted in these countries because most software that people use deals only with the local encodings. (Sounds familiar?)
I think that maybe an important point is getting lost here. I could be wrong, but it seems that all of this emphasis on encodings is misplaced. The physical and logical makeup of character strings are entirely separate issues. Unicode is a character set. It works in the logical domain. Dozens of different physical encodings can be used for Unicode characters. There are XML users who work with XML (and thus Unicode) every day and never see UTF-8, UTF-16 or any other Unicode-consortium "sponsored" encoding. If you invent an encoding tomorrow, it can still be XML-compatible. There are many encodings older than Unicode that are XML (and Unicode) compatible. I have not heard complaints about the XML way of looking at the world and in fact it was explicitly endorsed by many of the world's leading experts on internationalization. I haven't followed the Java situation as closely but I have also not heard screams about its support for il8n.
The truth of the matter is: the encoding of string objects is in the mind of the programmer. When I read a GIF file into a string object, the encoding is "binary goop".
IMHO, it's a mistake of history that you would even think it makes sense to read a GIF file into a "string" object and we should be trying to erase that mistake, as quickly as possible (which is admittedly not very quickly) not building more and more infrastructure around it. How can we make the transition to a "binary goops are not strings" world easiest?
The moral of all this? 8-bit strings are not going away.
If that is a statement of your long term vision, then I think that it is very unfortunate. Treating string literals as if they were isomorphic with byte arrays was probably the right thing in 1991 but it won't be in 2005. It doesn't meet the definition of string used in the Unicode spec., nor in XML, nor in Java, nor at the W3C nor in most other up and coming specifications.