Re: [Python-Dev] Unicode debate

April 28, 2000

      Guido van Rossum wrote:
...
...
I've heard a few people claim that strings should always be considered
to contain "characters" and that there should be one character per
string element.  I've also heard a clamoring that there should only be
one string type.  You folks have never used Asian encodings.  In
countries like Japan, China and Korea, encodings are a fact of life,
and the most popular encodings are ASCII supersets that use a variable
number of bytes per character, just like UTF-8.  Each country or
language uses different encodings, even though their characters look
mostly the same to western eyes.  UTF-8 and Unicode is having a hard
time getting adopted in these countries because most software that
people use deals only with the local encodings.  (Sounds familiar?)
I think that maybe an important point is getting lost here. I could be
wrong, but it seems that all of this emphasis on encodings is misplaced.

The physical and logical makeup of character strings are entirely
separate issues. Unicode is a character set. It works in the logical
domain.

Dozens of different physical encodings can be used for Unicode
characters. There are XML users who work with XML (and thus Unicode)
every day and never see UTF-8, UTF-16 or any other Unicode-consortium
"sponsored" encoding. If you invent an encoding tomorrow, it can still
be XML-compatible. There are many encodings older than Unicode that are
XML (and Unicode) compatible.

I have not heard complaints about the XML way of looking at the world
and in fact it was explicitly endorsed by many of the world's leading
experts on internationalization. I haven't followed the Java situation
as closely but I have also not heard screams about its support for il8n.
...
The truth of the matter is: the encoding of string objects is in the
mind of the programmer.  When I read a GIF file into a string object,
the encoding is "binary goop".
IMHO, it's a mistake of history that you would even think it makes sense
to read a GIF file into a "string" object and we should be trying to
erase that mistake, as quickly as possible (which is admittedly not very
quickly) not building more and more infrastructure around it. How can we
make the transition to a "binary goops are not strings" world easiest?
...
The moral of all this?  8-bit strings are not going away.
If that is a statement of your long term vision, then I think that it is
very unfortunate. Treating string literals as if they were isomorphic
with byte arrays was probably the right thing in 1991 but it won't be in
2005.

It doesn't meet the definition of string used in the Unicode spec., nor
in XML, nor in Java, nor at the W3C nor in most other up and coming
specifications.

Re: [Python-Dev] Unicode debate

Paul Prescod