[Python-Dev] Support for "wide" Unicode characters

Neil Hodgson nhodgson@bigpond.net.au
Sun, 1 Jul 2001 23:00:15 +1000


Paul Prescod:
<PEP: 261>

   The problem I have with this PEP is that it is a compile time option
which makes it hard to work with both 32 bit and 16 bit strings in one
program. Can not the 32 bit string type be introduced as an additional type?

> Are we going to change chr() and unichr() to one_element_string() and
> unicode_one_element_string()
>
> u[i] is a character. If u is Unicode, then u[i] is a Python Unicode
> character.

   This wasn't usefully true in the past for DBCS strings and is not the
right way to think of either narrow or wide strings now. The idea that
strings are arrays of characters gets in the way of dealing with many
encodings and is the primary difficulty in localising software for Japanese.
Iteration through the code units in a string is a problem waiting to bite
you and string APIs should encourage behaviour which is correct when faced
with variable width characters, both DBCS and UTF style. Iteration over
variable width characters should be performed in a way that preserves the
integrity of the characters. M.-A. Lemburg's proposed set of iterators could
be extended to indicate encoding "for c in s.asCharacters('utf-8')" and to
provide for the various intended string uses such as "for c in
s.inVisualOrder()" reversing the receipt of right-to-left substrings.

   Neil