Hi, I'm not sure how much value I can add, as I know little about the charsets etc. and a bit more about Python. As a user of these, and running a consultancy firm in Hong Kong, I can at least pass on some points and perhaps help you with testing later on. My first touch on international PCs was fixing a Japanese 8086 back in 1989, it didn't even have colour ! Hong Kong is quite an experience as there are two formats in common use, plus occasionally another gets thrown in. In HK they use the Traditional Chinese, whereas the mainland uses Simplified, as Guido says, there are a number of different types of these. Occasionally we see the Taiwanese charsets used. It seems to me that having each individual string variable encoded might just be too atomic, perhaps creating a cumbersome overhead in the system. For most applications I can settle for the entire app to be using a single charset, however from experience there are exceptions. We are normally working with prior knowledge of the charset being used, rather than having to deal with any charset which may come along (at an application level), and therefore generally work in a context, just as a European programmer would be working in say English or German. As you know, storage/retrieval is not a problem, but manipulation and comparison is. A nice way to handle this would be like operator overloading such that string operations would be perfomed in the context of the current charset, I could then change context as needed, removing the need for metadata surrounding the actual data. This should speed things up as each overloaded library could be optimised given the different quirks, and new ones could be added easily. My code could be easily re-used on different charsets by simply changing context externally to the code, rather than passing in lots of stuff and expecting Python to deal with it. Also I'd like very much to compile/load in only the International charsets that I need. I wouldn't want to see Java type bloat occurring to Python, and adding internationalisation for everything, is huge. I think what I am suggesting is a different approach which obviously places more onus on the programmer rather than Python. Perhaps this is not acceptable, I don't know as I've never developed a programming language. I hope this is a helpful point of view to get you thinking further, otherwise ... please ignore me and I'll keep quiet : ) Regards Paul ----- Original Message ----- From: "Guido van Rossum" <guido@python.org> To: <python-dev@python.org>; <i18n-sig@python.org> Cc: "Just van Rossum" <just@letterror.com> Sent: Thursday, April 27, 2000 11:01 PM Subject: [I18n-sig] Unicode debate
I'd like to reset this discussion. I don't think we need to involve c.l.py yet -- I haven't seen anyone with Asian language experience chime in there, and that's where this matters most. I am directing this to the Python i18n-sig mailing list, because that's where the debate belongs, and there interested parties can join the discussion without having to be vetted as "fit for python-dev" first.
I apologize for having been less than responsive in the matter; unfortunately there's lots of other stuff on my mind right now that has recently had a tendency to distract me with higher priority crises.
I've heard a few people claim that strings should always be considered to contain "characters" and that there should be one character per string element. I've also heard a clamoring that there should only be one string type. You folks have never used Asian encodings. In countries like Japan, China and Korea, encodings are a fact of life, and the most popular encodings are ASCII supersets that use a variable number of bytes per character, just like UTF-8. Each country or language uses different encodings, even though their characters look mostly the same to western eyes. UTF-8 and Unicode is having a hard time getting adopted in these countries because most software that people use deals only with the local encodings. (Sounds familiar?)
These encodings are much less "pure" than UTF-8, because they only encode the local characters (and ASCII), and because of various problems with slicing: if you look "in the middle" of an encoded string or file, you may not know how to interpret the bytes you see. There are overlaps (in most of these encodings anyway) between the codes used for single-byte and double-byte encodings, and you may have to look back one or more characters to know what to make of the particular byte you see. To get an idea of the nightmares that non-UTF-8 multibyte encodings give C/C++ programmers, see the Multibyte Character Set (MBCS) Survival Guide (http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm). See also the home page of the i18n-sig for more background information on encoding (and other i18n) issues (http://www.python.org/sigs/i18n-sig/).
UTF-8 attempts to solve some of these problems: the multi-byte encodings are chosen such that you can tell by the high bits of each byte whether it is (1) a single-byte (ASCII) character (top bit off), (2) the start of a multi-byte character (at least two top bits on; how many indicates the total number of bytes comprising the character), or (3) a continuation byte in a multi-byte character (top bit on, next bit off).
Many of the problems with non-UTF-8 multibyte encodings are the same as for UTF-8 though: #bytes != #characters, a byte may not be a valid character, regular expression patterns using "." may give the wrong results, and so on.
The truth of the matter is: the encoding of string objects is in the mind of the programmer. When I read a GIF file into a string object, the encoding is "binary goop". When I read a line of Japanese text from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to be an assumption built-in to my program, or perhaps information supplied separately (there's no easy way to guess based on the actual data). When I type a string literal using Latin-1 characters, the encoding is Latin-1. When I use octal escapes in a string literal, e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla). When I type a 7-bit string literal, the encoding is ASCII.
The moral of all this? 8-bit strings are not going away. They are not encoded in UTF-8 henceforth. Like before, and like 8-bit text files, they are encoded in whatever encoding you want. All you get is an extra mechanism to convert them to Unicode, and the Unicode conversion defaults to UTF-8 because it is the only conversion that is reversible. And, as Tim Peters quoted Andy Robinson (paraphrasing Tim's paraphrase), UTF-8 annoys everyone equally.
Where does the current approach require work?
- We need a way to indicate the encoding of Python source code. (Probably a "magic comment".)
- We need a way to indicate the encoding of input and output data files, and we need shortcuts to set the encoding of stdin, stdout and stderr (and maybe all files opened without an explicit encoding). Marc-Andre showed some sample code, but I believe it is still cumbersome. (I have to play with it more to see how it could be improved.)
- We need to discuss whether there should be a way to change the default conversion between Unicode and 8-bit strings (currently hardcoded to UTF-8), in order to make life easier for people who want to continue to use their favorite 8-bit encoding (e.g. Latin-1, or shift-JIS) but who also want to make use of the new Unicode datatype.
We're still in alpha, so we can still fix things.
--Guido van Rossum (home page: http://www.python.org/~guido/)