I'd like to reset this discussion. I don't think we need to involve c.l.py yet -- I haven't seen anyone with Asian language experience chime in there, and that's where this matters most. I am directing this to the Python i18n-sig mailing list, because that's where the debate belongs, and there interested parties can join the discussion without having to be vetted as "fit for python-dev" first.
I apologize for having been less than responsive in the matter; unfortunately there's lots of other stuff on my mind right now that has recently had a tendency to distract me with higher priority crises.
I've heard a few people claim that strings should always be considered to contain "characters" and that there should be one character per string element. I've also heard a clamoring that there should only be one string type. You folks have never used Asian encodings. In countries like Japan, China and Korea, encodings are a fact of life, and the most popular encodings are ASCII supersets that use a variable number of bytes per character, just like UTF-8. Each country or language uses different encodings, even though their characters look mostly the same to western eyes. UTF-8 and Unicode is having a hard time getting adopted in these countries because most software that people use deals only with the local encodings. (Sounds familiar?)
These encodings are much less "pure" than UTF-8, because they only encode the local characters (and ASCII), and because of various problems with slicing: if you look "in the middle" of an encoded string or file, you may not know how to interpret the bytes you see. There are overlaps (in most of these encodings anyway) between the codes used for single-byte and double-byte encodings, and you may have to look back one or more characters to know what to make of the particular byte you see. To get an idea of the nightmares that non-UTF-8 multibyte encodings give C/C++ programmers, see the Multibyte Character Set (MBCS) Survival Guide (http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm). See also the home page of the i18n-sig for more background information on encoding (and other i18n) issues (http://www.python.org/sigs/i18n-sig/).
UTF-8 attempts to solve some of these problems: the multi-byte encodings are chosen such that you can tell by the high bits of each byte whether it is (1) a single-byte (ASCII) character (top bit off), (2) the start of a multi-byte character (at least two top bits on; how many indicates the total number of bytes comprising the character), or (3) a continuation byte in a multi-byte character (top bit on, next bit off).
Many of the problems with non-UTF-8 multibyte encodings are the same as for UTF-8 though: #bytes != #characters, a byte may not be a valid character, regular expression patterns using "." may give the wrong results, and so on.
The truth of the matter is: the encoding of string objects is in the mind of the programmer. When I read a GIF file into a string object, the encoding is "binary goop". When I read a line of Japanese text from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to be an assumption built-in to my program, or perhaps information supplied separately (there's no easy way to guess based on the actual data). When I type a string literal using Latin-1 characters, the encoding is Latin-1. When I use octal escapes in a string literal, e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla). When I type a 7-bit string literal, the encoding is ASCII.
The moral of all this? 8-bit strings are not going away. They are not encoded in UTF-8 henceforth. Like before, and like 8-bit text files, they are encoded in whatever encoding you want. All you get is an extra mechanism to convert them to Unicode, and the Unicode conversion defaults to UTF-8 because it is the only conversion that is reversible. And, as Tim Peters quoted Andy Robinson (paraphrasing Tim's paraphrase), UTF-8 annoys everyone equally.
Where does the current approach require work?
- We need a way to indicate the encoding of Python source code. (Probably a "magic comment".)
- We need a way to indicate the encoding of input and output data files, and we need shortcuts to set the encoding of stdin, stdout and stderr (and maybe all files opened without an explicit encoding). Marc-Andre showed some sample code, but I believe it is still cumbersome. (I have to play with it more to see how it could be improved.)
- We need to discuss whether there should be a way to change the default conversion between Unicode and 8-bit strings (currently hardcoded to UTF-8), in order to make life easier for people who want to continue to use their favorite 8-bit encoding (e.g. Latin-1, or shift-JIS) but who also want to make use of the new Unicode datatype.
We're still in alpha, so we can still fix things.
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)