[I18n-sig] Re: [Python-Dev] Re: [XML-SIG] Python 1.6a2 Unicode experiences?

Tom Emerson tree@basistech.com
Fri, 28 Apr 2000 06:56:50 -0400 (EDT)

M.-A. Lemburg writes:
 > > > Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use
 > > > 8-bit encodings of Unicode if you want.

This is meaningless: legacy encodings of national character
sets such Shift-JIS, Big Five, GB2312, or TIS620 are not "encodings"
of Unicode.

TIS620 is a single-byte, 8-bit encoding: each character is
represented by a single byte. The Japanese and Chinese encodings are
multibyte, 8-bit, encodings. ISO-2022 is a multi-byte, 7-bit encoding
for multiple character sets.

Unicode has several possible encodings: UTF-8, UCS-2, UCS-4,
UTF-16... You can view all of these as 8-bit encodings, if you
like. Some are multibyte (such as UTF-8, where each character in
Unicode is represented in 1 to 3 bytes) while others are fixed length,
two or four bytes per character.

 > > Um, if you go:
 > > 
 > >     JIS -> Unicode -> JIS
 > > 
 > > you don't get the same thing out that you put in (at least this is
 > > what I've been told by a lot of Japanese developers), and therefore
 > > it's not terribly popular because of the nature of the Japanese (and
 > > Chinese) langauge.

This is simply not true any more. The ability to round trip between
Unicode and legacy encodings is dependent on the software: being able
to use code points in the PUA for this is acceptable and commonly

The big advantage is in using Unicode as a pivot when transcoding
between different CJK encodings. It is very difficult to map between,
say, Shift JIS and GB2312, directly. However, Unicode provides a good

It isn't a panacea: transcoding between legacy encodings like GB2312
and Big Five is still difficult: Unicode or not.

 > > My experience with Unicode is that a lot of Western people think it's
 > > the answer to every problem asked, while most asian language people
 > > disagree vehemently.  This says the problem isn't solved yet, even if
 > > people wish to deny it.

This is a shame: it is an indication that they don't understand the
technology. Unicode is a tool: nothing more.


Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"