Python - Next Release Questions

François Pinard pinard at iro.umontreal.ca
Tue Mar 28 21:02:55 EST 2000


"Dennis E. Hamilton" <infonuovo at email.com> writes:

> The adoption of 16-bit Unicode (or any duet-string character codes or
> whatever term we use for that) is not all that universal just yet,
> and there are all those font issues and locale issues to deal with.

The worse problem to come, in my opinion, is the battle between pre-combined
characters and dynamically combined characters.  I guess that not many
applications said to support Unicode actually do process 32-, 48- and 64-bit
characters, when one to three diacritical marks are needed.  Unicode and W3C,
in particular, are trying to halt the trend of pre-combining characters,
for those languages not lucky enough to have made it so far.  This would not
be very acceptable for such languages.  (Imagine, for example, the nightmare
that Vietnamese Unicode processing would be, without precomposed characters.)

> Figuring out automatic down-shifting from Unicode to octet-string is going
> to be messy and I like a solution, in this transition period, that lets
> applications deal with it.

At first glance, the incoming Python is almost nice about it, as coercing
a Unicode string to a single-byte string conversion produces its UTF-8
representation.  But I did not really try a Unicode application in Python
yet, so from my viewpoint, a lot remains to be seen about usability.  But I'm
really eager to try it on a real project, at the next opportunity! :-)

> And this is just the beginning.  I'm told that the next move down the
> road is to quartet-string character codes, and I think that's pretty
> painful all the way around.

Moreover, keep in mind that going UCS-4 internally, also means 64-, 96-
and 128-bits characters, for many languages not yet covered by Unicode.

Just to add more spice to the picture, also consider that UCS-2 has
surrogate areas for combining two 16-bit for representing a new one,
which might later be theoretically combinable dynamically with diacritics.
Altogether, that's a lot of variability in the length of internal characters.
And UTF-8 adds yet another layer of variability as an external coding...

> The Java world has taken a purist approach to this.

Oh, they _all_ take a purist approach, but rarely the same.  Religious
wars ahead :-).  Fun to come, my friends, a lot of fun to come!

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard






More information about the Python-list mailing list