[Python-Dev] Support for "wide" Unicode characters

Guido van Rossum guido@digicool.com
Sun, 01 Jul 2001 09:44:29 -0400


> <PEP: 261>
> 
>    The problem I have with this PEP is that it is a compile time option
> which makes it hard to work with both 32 bit and 16 bit strings in one
> program. Can not the 32 bit string type be introduced as an additional type?

Not without an outrageous amount of additional coding (every place in
the code that currently uses PyUnicode_Check() would have to be
bifurcated in a 16-bit and a 32-bit variant).

I doubt that the desire to work with both 16- and 32-bit characters in
one program is typical for folks using Unicode -- that's mostly
limited to folks writing conversion tools.  Python will offer the
necessary codecs so you shouldn't have this need very often.

You can use the array module to manipulate 16- and 32-bit arrays, and
you can use the various Unicode encodings to do the necessary
encodings.

> > u[i] is a character. If u is Unicode, then u[i] is a Python Unicode
> > character.
> 
>    This wasn't usefully true in the past for DBCS strings and is not the
> right way to think of either narrow or wide strings now. The idea that
> strings are arrays of characters gets in the way of dealing with many
> encodings and is the primary difficulty in localising software for Japanese.

Can you explain the kind of problems encountered in some more detail?

> Iteration through the code units in a string is a problem waiting to bite
> you and string APIs should encourage behaviour which is correct when faced
> with variable width characters, both DBCS and UTF style.

But this is not the Unicode philosophy.  All the variable-length
character manipulation is supposed to be taken care of by the codecs,
and then the application can deal in arrays of characteres.
Alternatively, the application can deal in opaque objects representing
variable-length encodings, but then it should be very careful with
concatenation and even more so with slicing.

> Iteration over
> variable width characters should be performed in a way that preserves the
> integrity of the characters. M.-A. Lemburg's proposed set of iterators could
> be extended to indicate encoding "for c in s.asCharacters('utf-8')" and to
> provide for the various intended string uses such as "for c in
> s.inVisualOrder()" reversing the receipt of right-to-left substrings.

I think it's a good idea to provide a set of higher-level tools as
well.  However nobody seems to know what these higher-level tools
should do yet.  PEP 261 is specifically focused on getting the
lower-level foundations right (i.e. the objects that represent arrays
of code units), so that the authors of higher level tools will have a
solid base.  If you want to help author a PEP for such higher-level
tools, you're welcome!

--Guido van Rossum (home page: http://www.python.org/~guido/)