[I18n-sig] Unicode debate

Paul Gresham gresham@mediavisual.com
Fri, 28 Apr 2000 00:41:04 +0800

Hi, I'm not sure how much value I can add, as I know little about the
charsets etc. and a bit more about Python. As a user of these, and running a
consultancy firm in Hong Kong, I can at least pass on some points and
perhaps help you with testing later on. My first touch on international PCs
was fixing a Japanese 8086 back in 1989, it didn't even have colour ! Hong
Kong is quite an experience as there are two formats in common use, plus
occasionally another gets thrown in. In HK they use the Traditional Chinese,
whereas the mainland uses Simplified, as Guido says, there are a number of
different types of these. Occasionally we see the Taiwanese charsets used.

It seems to me that having each individual string variable encoded might
just be too atomic, perhaps creating a cumbersome overhead in the system.
For most applications I can settle for the entire app to be using a single
charset, however from experience there are exceptions. We are normally
working with prior knowledge of the charset being used, rather than having
to deal with any charset which may come along (at an application level), and
therefore generally work in a context, just as a European programmer would
be working in say English or German.

As you know, storage/retrieval is not a problem, but manipulation and
comparison is. A nice way to handle this would be like operator overloading
such that string operations would be perfomed in the context of the current
charset, I could then change context as needed, removing the need for
metadata surrounding the actual data. This should speed things up as each
overloaded library could be optimised given the different quirks, and new
ones could be added easily. My code could be easily re-used on different
charsets by simply changing context externally to the code, rather than
passing in lots of stuff and expecting Python to deal with it.

Also I'd like very much to compile/load in only the International charsets
that I need. I wouldn't want to see Java type bloat occurring to Python, and
adding internationalisation for everything, is huge.

I think what I am suggesting is a different approach which obviously places
more onus on the programmer rather than Python. Perhaps this is not
acceptable, I don't know as I've never developed a programming language.

I hope this is a helpful point of view to get you thinking further,
otherwise ... please ignore me and I'll keep quiet : )


----- Original Message -----
From: "Guido van Rossum" <guido@python.org>
To: <python-dev@python.org>; <i18n-sig@python.org>
Cc: "Just van Rossum" <just@letterror.com>
Sent: Thursday, April 27, 2000 11:01 PM
Subject: [I18n-sig] Unicode debate

> I'd like to reset this discussion.  I don't think we need to involve
> c.l.py yet -- I haven't seen anyone with Asian language experience
> chime in there, and that's where this matters most.  I am directing
> this to the Python i18n-sig mailing list, because that's where the
> debate belongs, and there interested parties can join the discussion
> without having to be vetted as "fit for python-dev" first.
> I apologize for having been less than responsive in the matter;
> unfortunately there's lots of other stuff on my mind right now that
> has recently had a tendency to distract me with higher priority
> crises.
> I've heard a few people claim that strings should always be considered
> to contain "characters" and that there should be one character per
> string element.  I've also heard a clamoring that there should only be
> one string type.  You folks have never used Asian encodings.  In
> countries like Japan, China and Korea, encodings are a fact of life,
> and the most popular encodings are ASCII supersets that use a variable
> number of bytes per character, just like UTF-8.  Each country or
> language uses different encodings, even though their characters look
> mostly the same to western eyes.  UTF-8 and Unicode is having a hard
> time getting adopted in these countries because most software that
> people use deals only with the local encodings.  (Sounds familiar?)
> These encodings are much less "pure" than UTF-8, because they only
> encode the local characters (and ASCII), and because of various
> problems with slicing: if you look "in the middle" of an encoded
> string or file, you may not know how to interpret the bytes you see.
> There are overlaps (in most of these encodings anyway) between the
> codes used for single-byte and double-byte encodings, and you may have
> to look back one or more characters to know what to make of the
> particular byte you see.  To get an idea of the nightmares that
> non-UTF-8 multibyte encodings give C/C++ programmers, see the
> Multibyte Character Set (MBCS) Survival Guide
> (http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm).
> See also the home page of the i18n-sig for more background information
> on encoding (and other i18n) issues
> (http://www.python.org/sigs/i18n-sig/).
> UTF-8 attempts to solve some of these problems: the multi-byte
> encodings are chosen such that you can tell by the high bits of each
> byte whether it is (1) a single-byte (ASCII) character (top bit off),
> (2) the start of a multi-byte character (at least two top bits on; how
> many indicates the total number of bytes comprising the character), or
> (3) a continuation byte in a multi-byte character (top bit on, next
> bit off).
> Many of the problems with non-UTF-8 multibyte encodings are the same
> as for UTF-8 though: #bytes != #characters, a byte may not be a valid
> character, regular expression patterns using "." may give the wrong
> results, and so on.
> The truth of the matter is: the encoding of string objects is in the
> mind of the programmer.  When I read a GIF file into a string object,
> the encoding is "binary goop".  When I read a line of Japanese text
> from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to
> be an assumption built-in to my program, or perhaps information
> supplied separately (there's no easy way to guess based on the actual
> data).  When I type a string literal using Latin-1 characters, the
> encoding is Latin-1.  When I use octal escapes in a string literal,
> e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla).
> When I type a 7-bit string literal, the encoding is ASCII.
> The moral of all this?  8-bit strings are not going away.  They are
> not encoded in UTF-8 henceforth.  Like before, and like 8-bit text
> files, they are encoded in whatever encoding you want.  All you get is
> an extra mechanism to convert them to Unicode, and the Unicode
> conversion defaults to UTF-8 because it is the only conversion that is
> reversible.  And, as Tim Peters quoted Andy Robinson (paraphrasing
> Tim's paraphrase), UTF-8 annoys everyone equally.
> Where does the current approach require work?
> - We need a way to indicate the encoding of Python source code.
> (Probably a "magic comment".)
> - We need a way to indicate the encoding of input and output data
> files, and we need shortcuts to set the encoding of stdin, stdout and
> stderr (and maybe all files opened without an explicit encoding).
> Marc-Andre showed some sample code, but I believe it is still
> cumbersome.  (I have to play with it more to see how it could be
> improved.)
> - We need to discuss whether there should be a way to change the
> default conversion between Unicode and 8-bit strings (currently
> hardcoded to UTF-8), in order to make life easier for people who want
> to continue to use their favorite 8-bit encoding (e.g. Latin-1, or
> shift-JIS) but who also want to make use of the new Unicode datatype.
> We're still in alpha, so we can still fix things.
> --Guido van Rossum (home page: http://www.python.org/~guido/)