[Python-Dev] Re: [I18n-sig] Unicode debate

28 Apr 2000

      Hi, I'm not sure how much value I can add, as I know little about the
charsets etc. and a bit more about Python. As a user of these, and running a
consultancy firm in Hong Kong, I can at least pass on some points and
perhaps help you with testing later on. My first touch on international PCs
was fixing a Japanese 8086 back in 1989, it didn't even have colour ! Hong
Kong is quite an experience as there are two formats in common use, plus
occasionally another gets thrown in. In HK they use the Traditional Chinese,
whereas the mainland uses Simplified, as Guido says, there are a number of
different types of these. Occasionally we see the Taiwanese charsets used.

It seems to me that having each individual string variable encoded might
just be too atomic, perhaps creating a cumbersome overhead in the system.
For most applications I can settle for the entire app to be using a single
charset, however from experience there are exceptions. We are normally
working with prior knowledge of the charset being used, rather than having
to deal with any charset which may come along (at an application level), and
therefore generally work in a context, just as a European programmer would
be working in say English or German.

As you know, storage/retrieval is not a problem, but manipulation and
comparison is. A nice way to handle this would be like operator overloading
such that string operations would be perfomed in the context of the current
charset, I could then change context as needed, removing the need for
metadata surrounding the actual data. This should speed things up as each
overloaded library could be optimised given the different quirks, and new
ones could be added easily. My code could be easily re-used on different
charsets by simply changing context externally to the code, rather than
passing in lots of stuff and expecting Python to deal with it.

Also I'd like very much to compile/load in only the International charsets
that I need. I wouldn't want to see Java type bloat occurring to Python, and
adding internationalisation for everything, is huge.

I think what I am suggesting is a different approach which obviously places
more onus on the programmer rather than Python. Perhaps this is not
acceptable, I don't know as I've never developed a programming language.

I hope this is a helpful point of view to get you thinking further,
otherwise ... please ignore me and I'll keep quiet : )

Regards
Paul

----- Original Message -----
From: "Guido van Rossum" 
To: ; 
Cc: "Just van Rossum" 
Sent: Thursday, April 27, 2000 11:01 PM
Subject: [I18n-sig] Unicode debate
...
I'd like to reset this discussion.  I don't think we need to involve
c.l.py yet -- I haven't seen anyone with Asian language experience
chime in there, and that's where this matters most.  I am directing
this to the Python i18n-sig mailing list, because that's where the
debate belongs, and there interested parties can join the discussion
without having to be vetted as "fit for python-dev" first.
I apologize for having been less than responsive in the matter;
unfortunately there's lots of other stuff on my mind right now that
has recently had a tendency to distract me with higher priority
crises.
I've heard a few people claim that strings should always be considered
to contain "characters" and that there should be one character per
string element.  I've also heard a clamoring that there should only be
one string type.  You folks have never used Asian encodings.  In
countries like Japan, China and Korea, encodings are a fact of life,
and the most popular encodings are ASCII supersets that use a variable
number of bytes per character, just like UTF-8.  Each country or
language uses different encodings, even though their characters look
mostly the same to western eyes.  UTF-8 and Unicode is having a hard
time getting adopted in these countries because most software that
people use deals only with the local encodings.  (Sounds familiar?)
These encodings are much less "pure" than UTF-8, because they only
encode the local characters (and ASCII), and because of various
problems with slicing: if you look "in the middle" of an encoded
string or file, you may not know how to interpret the bytes you see.
There are overlaps (in most of these encodings anyway) between the
codes used for single-byte and double-byte encodings, and you may have
to look back one or more characters to know what to make of the
particular byte you see.  To get an idea of the nightmares that
non-UTF-8 multibyte encodings give C/C++ programmers, see the
Multibyte Character Set (MBCS) Survival Guide
(http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm).
See also the home page of the i18n-sig for more background information
on encoding (and other i18n) issues
(http://www.python.org/sigs/i18n-sig/).
UTF-8 attempts to solve some of these problems: the multi-byte
encodings are chosen such that you can tell by the high bits of each
byte whether it is (1) a single-byte (ASCII) character (top bit off),
(2) the start of a multi-byte character (at least two top bits on; how
many indicates the total number of bytes comprising the character), or
(3) a continuation byte in a multi-byte character (top bit on, next
bit off).
Many of the problems with non-UTF-8 multibyte encodings are the same
as for UTF-8 though: #bytes != #characters, a byte may not be a valid
character, regular expression patterns using "." may give the wrong
results, and so on.
The truth of the matter is: the encoding of string objects is in the
mind of the programmer.  When I read a GIF file into a string object,
the encoding is "binary goop".  When I read a line of Japanese text
from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to
be an assumption built-in to my program, or perhaps information
supplied separately (there's no easy way to guess based on the actual
data).  When I type a string literal using Latin-1 characters, the
encoding is Latin-1.  When I use octal escapes in a string literal,
e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla).
When I type a 7-bit string literal, the encoding is ASCII.
The moral of all this?  8-bit strings are not going away.  They are
not encoded in UTF-8 henceforth.  Like before, and like 8-bit text
files, they are encoded in whatever encoding you want.  All you get is
an extra mechanism to convert them to Unicode, and the Unicode
conversion defaults to UTF-8 because it is the only conversion that is
reversible.  And, as Tim Peters quoted Andy Robinson (paraphrasing
Tim's paraphrase), UTF-8 annoys everyone equally.
Where does the current approach require work?
- We need a way to indicate the encoding of Python source code.
(Probably a "magic comment".)
- We need a way to indicate the encoding of input and output data
files, and we need shortcuts to set the encoding of stdin, stdout and
stderr (and maybe all files opened without an explicit encoding).
Marc-Andre showed some sample code, but I believe it is still
cumbersome.  (I have to play with it more to see how it could be
improved.)
- We need to discuss whether there should be a way to change the
default conversion between Unicode and 8-bit strings (currently
hardcoded to UTF-8), in order to make life easier for people who want
to continue to use their favorite 8-bit encoding (e.g. Latin-1, or
shift-JIS) but who also want to make use of the new Unicode datatype.
We're still in alpha, so we can still fix things.
--Guido van Rossum (home page: http://www.python.org/~guido/)

[Python-Dev] Re: [I18n-sig] Unicode debate

Paul Gresham