[I18n-sig] Random thoughts on Unicode and Python

Tom Emerson tree@basistech.com
Sat, 10 Feb 2001 16:17:47 -0500

Andy has raised some important and interesting points. I'd like to
chime in with some random thoughts.

> 2. I have been told that there are angry mumblings on the
> Python-Japan mailing list that such a change would break all
> their existing Python programs; I'm trying to set up my tools to
> ask out loud in that forum.

Both Shift-JIS and EUC-JP are 8-bit, multibyte encodings. You can use
them on systems that are 8-bit clean and things "just work". You don't
need to worry about embedded nulls or any other such noise. While you
can't use len() to get the number of *characters* in a
Shift-JIS/EUC-JP encoded string, you can find out how many "octets"
are in it so you can loop over it and calculate the character length.

In essence the Japanese (and Chinese and Koreans) are using the
existing Python string type as a raw-byte string, and imposing the
semantics over that.

The Ruby string class is a byte-string. You can specify how the bytes
are to be treated for operations such as regular expression searches
and such. It supports EUC-JP, Shift JIS, UTF-8, or just plan
bytes. You can set the default when you configure the sources, on the
command-line when you invoke the interpreter, or (I believe) at

Ruby also contains a library with a replacement String class for
dealing with EUC-JP and Shift-JIS encoded strings.


The internal representation used for strings is an orthogonal issue to
how raw bytes are interpreted for string operations. This is what
Emacs 20 does: in essence it uses ISO 2022 internally to allow
characters from multiple character sets to be represented.


The interpretation of strings and the interpretation of bytes in a
source file are different things: Dylan, for example, supports Unicode
and byte strings, but the language definition requires identifiers and
keywords to be in the US-ASCII range. Java, on the other hand,
specifies Unicode as language's character set: even source files are
encoded in UTF-8, allowing identifiers to be in the user's
language. IMHO either is fine. Note that if the language allows
identifiers to include 8-bit characters then users can already use
identifiers in their local language: it "just works".


Japanese and Chinese arguments against Unicode are often ideological:
"It doesn't contain all of the characters we need." Of course they
forget to mention that the character sets in regular use in these
locales, JIS X 0201-1990, JIS X 0212-1990, GB 2312-80, and Big Five,
are all represented in Unicode. The same is true for Korean: all of
the hanja in KS C 5601 et al. are available in Unicode, as are the
precomposed han'gul.

Tom Emerson                                          Basis Technology Corp.
Stringologist                                      http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"