[I18n-sig] Re: Unicode debate
Thu, 27 Apr 2000 16:50:53 -0700
> Christopher Petrilli email@example.com <mailto:petrilli%40amber.org>
>> Guido van Rossum [firstname.lastname@example.org <mailto:email@example.com>] wrote:
>> I've heard a few people claim that strings should always be considered
>> to contain "characters" and that there should be one character per
>> string element. I've also heard a clamoring that there should only be
>> one string type. You folks have never used Asian encodings. In
>> countries like Japan, China and Korea, encodings are a fact of life,
>> and the most popular encodings are ASCII supersets that use a variable
>> number of bytes per character, just like UTF-8. Each country or
>> language uses different encodings, even though their characters look
>> mostly the same to western eyes. UTF-8 and Unicode is having a hard
>> time getting adopted in these countries because most software that
>> people use deals only with the local encodings. (Sounds familiar?)
> Actually a bigger concern that we hear from our customers in Japan is
> that Unicode has *serious* problems in asian languages. Theey took
> the "unification" of Chinese and Japanese, rather than both, and
> therefore can not represent los of phrases quite right. I can have
> someone write up a better dscription, but I was told by several
> Japanese people that they wouldn't use Unicode come hell or high
> water, basically.
Yeah, not all of the east asian ideographs are availble in Unicode atm. :(
Currently there are two pending extensions to the unified CJK ideographs.
Extension A is slated as part of the BMP. 0x0000 - 0xAAFF in Plane 2 is
currently slated for use by Extension B.
BMP Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2213.pdf
Plane 2 Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2215.pdf
On top of which is there is this serious problem of end user defined
characters in a number of these MBCS encodings.
Win32 OSs handles mapping these characters into Unicode in the following
In the Win32 registry at:
There exists several REG_SZ registry values. The names of the values are
MBCS code pages.
The values are source ranges in the codepage's code space.
These ranges get mapped into Unicode code space starting at U+E000 (the
beginning of the BMP private use area).
> Basically it's JJIS, Shift-JIS or nothing for most Japanese
> companies. This was my experience working with Konica a few years ago
> as well.
Don't forget the new JIS X 0213. :)