[I18n-sig] Unicode surrogates: just say no!

Tom Emerson tree@basistech.com
Tue, 26 Jun 2001 12:40:48 -0400

Guido van Rossum writes:
> > UTF-8 can be used to encode encode each half of a surrogate pair
> > (resulting in six-bytes for the character) --- a proposal for this was
> > presented by PeopleSoft at the UTC meeting last month. UTF-8 can also
> > encode the code-point directly in four bytes.
> But isn't the direct encoding highly preferable?  When would you ever
> want your UTF-8 to be encoded UTF-16?

Amen. There were other reasons related to sort orders that I'm not
clear on as I didn't pay much attention to non-Asian issues.

> > Remember too that glibc uses UCS-4 as its internal wchar_t
> > representation. This was also discussed at the Li18nux meetings a
> > couple of years ago.
> But I don't think there are many Linux applications that use wchar_t
> extensively yet.  At least I haven't seen any.  (Does anyone know if
> Mozilla's Asian character support uses wchar_t or Unicode?)

I don't have statistics on this, but I don't think it much matters: I
doubt Linux application developers are failing to use wchar_t because
it is 4-bytes.

I merely point to glibc as an example where a conscious decision was
made to go with a 4-byte wide character type in order to allow for
easy future growth without being constrained by alternate
transformation formats of Unicode. Ulrich Drepper made the right
choice, which was supported by the Li18nux group, which includes the
Linux vendors as well as IBM and Basis.

Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"