[I18n-sig] Unicode debate

Guido van Rossum guido@python.org
Thu, 27 Apr 2000 11:01:48 -0400


I'd like to reset this discussion.  I don't think we need to involve
c.l.py yet -- I haven't seen anyone with Asian language experience
chime in there, and that's where this matters most.  I am directing
this to the Python i18n-sig mailing list, because that's where the
debate belongs, and there interested parties can join the discussion
without having to be vetted as "fit for python-dev" first.

I apologize for having been less than responsive in the matter;
unfortunately there's lots of other stuff on my mind right now that
has recently had a tendency to distract me with higher priority
crises.

I've heard a few people claim that strings should always be considered
to contain "characters" and that there should be one character per
string element.  I've also heard a clamoring that there should only be
one string type.  You folks have never used Asian encodings.  In
countries like Japan, China and Korea, encodings are a fact of life,
and the most popular encodings are ASCII supersets that use a variable
number of bytes per character, just like UTF-8.  Each country or
language uses different encodings, even though their characters look
mostly the same to western eyes.  UTF-8 and Unicode is having a hard
time getting adopted in these countries because most software that
people use deals only with the local encodings.  (Sounds familiar?)

These encodings are much less "pure" than UTF-8, because they only
encode the local characters (and ASCII), and because of various
problems with slicing: if you look "in the middle" of an encoded
string or file, you may not know how to interpret the bytes you see.
There are overlaps (in most of these encodings anyway) between the
codes used for single-byte and double-byte encodings, and you may have
to look back one or more characters to know what to make of the
particular byte you see.  To get an idea of the nightmares that
non-UTF-8 multibyte encodings give C/C++ programmers, see the
Multibyte Character Set (MBCS) Survival Guide
(http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm).
See also the home page of the i18n-sig for more background information
on encoding (and other i18n) issues
(http://www.python.org/sigs/i18n-sig/).

UTF-8 attempts to solve some of these problems: the multi-byte
encodings are chosen such that you can tell by the high bits of each
byte whether it is (1) a single-byte (ASCII) character (top bit off),
(2) the start of a multi-byte character (at least two top bits on; how
many indicates the total number of bytes comprising the character), or
(3) a continuation byte in a multi-byte character (top bit on, next
bit off).

Many of the problems with non-UTF-8 multibyte encodings are the same
as for UTF-8 though: #bytes != #characters, a byte may not be a valid
character, regular expression patterns using "." may give the wrong
results, and so on.

The truth of the matter is: the encoding of string objects is in the
mind of the programmer.  When I read a GIF file into a string object,
the encoding is "binary goop".  When I read a line of Japanese text
from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to
be an assumption built-in to my program, or perhaps information
supplied separately (there's no easy way to guess based on the actual
data).  When I type a string literal using Latin-1 characters, the
encoding is Latin-1.  When I use octal escapes in a string literal,
e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla).
When I type a 7-bit string literal, the encoding is ASCII.

The moral of all this?  8-bit strings are not going away.  They are
not encoded in UTF-8 henceforth.  Like before, and like 8-bit text
files, they are encoded in whatever encoding you want.  All you get is
an extra mechanism to convert them to Unicode, and the Unicode
conversion defaults to UTF-8 because it is the only conversion that is
reversible.  And, as Tim Peters quoted Andy Robinson (paraphrasing
Tim's paraphrase), UTF-8 annoys everyone equally.

Where does the current approach require work?

- We need a way to indicate the encoding of Python source code.
(Probably a "magic comment".)

- We need a way to indicate the encoding of input and output data
files, and we need shortcuts to set the encoding of stdin, stdout and
stderr (and maybe all files opened without an explicit encoding).
Marc-Andre showed some sample code, but I believe it is still
cumbersome.  (I have to play with it more to see how it could be
improved.)

- We need to discuss whether there should be a way to change the
default conversion between Unicode and 8-bit strings (currently
hardcoded to UTF-8), in order to make life easier for people who want
to continue to use their favorite 8-bit encoding (e.g. Latin-1, or
shift-JIS) but who also want to make use of the new Unicode datatype.

We're still in alpha, so we can still fix things.

--Guido van Rossum (home page: http://www.python.org/~guido/)