[I18n-sig] Support for "wide" Unicode characters

M.-A. Lemburg mal@lemburg.com
Thu, 28 Jun 2001 11:27:35 +0200

Paul Prescod wrote:
> Round 2: I can't check in right now but I'll collect another round of
> suggestions and then post this to other lists tomorrow.

Here you go...

> ----
> PEP: 261
> Title: Support for "wide" Unicode characters
> Version: $Revision: 1.2 $
> Author: paulp@activestate.com (Paul Prescod)
> Status: Draft
> Type: Standards Track
> Created: 27-Jun-2001
> Python-Version: 2.2
> Post-History: 27-Jun-2001
> Abstract
>     Python 2.1 unicode characters can have ordinals only up to 65536.
>     These characters are known as Basic Multilinual Plane characters.
>     There are now characters in Unicode that live on other "planes".
>     The largest addressable character in Unicode has the ordinal
>     17 * 2**16 - 1. For readability, we will call this TOPCHAR.

I would add hex notations for those who are more familiar with
HEX and Unicode (which uses HEX to pinpoint code points).

Also, a suggestion: I think to avoid all the problems of understanding
the different terms in this PEP, I'd do two things:

1. add a Glossary (copying from the Unicode glossary)
2. use the standard Unicode terms throughout the PEP (code points,
   code units, etc.)

The reason is that otherwise you'll get confusion about what
you mean by noncharacter characters ;-)
> Proposed Solution
>     One solution would be to merely increase the maximum ordinal to a
>     larger value.  Unfortunately the only straightforward
>     implementation of this idea is to increase the character code unit
>     to 4 bytes.  This has the effect of doubling the size of most
>     Unicode strings.  In order to avoid imposing this cost on every
>     user, Python 2.2 will allow 4-byte Unicode characters as a
>     build-time option.
>     The 4-byte option is called "wide Py_UNICODE".  The 2-byte option
>     is called "narrow Py_UNICODE".
>     Most things will behave identically in the wide and narrow worlds.
>     * the \u and \U literal syntaxes will always generate the same
>       data that the unichr function would.  They are just different
>       syntaxes for the same thing.
>     * unichr(i) for 0 <= i <= 2**16 always returns a size-one string.
>     * unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a
>       string representing the character.


If the platform does not support the character in question,
then this should raise a ValueError instead of returning anything
with len() > 1.

Reasoning: u[i] in Python should always refer to a code
point *and* code unit in the Unicode sense. If this is not
possible, raise an exception.
>     * BUT on narrow builds of Python, the string will actually be
>       composed of two characters (in the Python, not Unicode sense)
>       called a "surrogate pair". These two Python characters are
>       logically one Unicode character.
>         ISSUE: Should Python return surrogate pairs or narrow builds
>                or should it just disallow them?
>         ISSUE: Should the upper bound of the domain of unichr and
>                range of ord() be TOPCHAR or 2**32-1 or even 2**31?

-1. See above.
>     * ord() will now accept surrogate pairs and return the ordinal of
>       the "wide" character.
>         ISSUE: Should Python accept surrogate pairs on wide
>                Python builds?

-1. Have the codecs do the business of dealing with surrogates and
ord() return the code point ordinal (isolated surrogates are 
code points as well; they are not Unicode characters though).
>     * There is an integer value in the sys module that describes the
>       largest ordinal for a Unicode character on the current
>       interpreter. sys.maxunicode is 2**16-1 on narrow builds of
>       Python.
>         ISSUE: Should sys.maxunicode be TOPCHAR or 2**32-1 or even
>                2**31 on wide builds?
>         ISSUE: Should there be distinct constants for accessing
>                TOPCHAR and the real upper bound for the domain of
>                unichr?

Hmm, not sure. 

Wouldn't it be better to simply an attribute
sys.unicodewidth == 'narrow' | 'wide' ? This leaves out all
the complicated issues and redirects people to this PEP.

>     * Note that ord() can in some cases return ordinals higher than
>       sys.maxunicode because it accepts surrogate pairs on narrow
>       Python builds.

>     * codecs will be upgraded to support "wide characters"
>       (represented directly in UCS-4, as surrogate pairs in UTF-16 and
>       as multi-byte sequences in UTF-8). On narrow Python builds, the
>       codecs will generate surrogate pairs, on wide Python builds they
>       will generate a single character. This is the main part of the
>       implementation left to be done.

+1. This is how surrogates should be treated: in the codecs !
>     * there are no restrictions on constructing strings that use
>       code points "reserved for surrogates" improperly. These are
>       called "lone surrogates".

Better call them "isolated surrogates"; that's the term Mark
Davis used and he should know.

>       The codecs should disallow reading
>       these but you could construct them using string literals or
>       unichr(). unichr() is not restricted to values less than either
>       TOPCHAR nor sys.maxunicode.
>         ISSUE: Should lone surrogates be allowed as input to ord even
>                on wide platforms where they "should" not occur?

Yes, see above. Isolated surrogates are true code points.
> Implementation
>     There is a new (experimental) define:
>         #define PY_UNICODE_SIZE 2

Doesn't sizeof(Py_UNICODE) do the same ?
>     There is a new configure options:
>         --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
>                               wchar_t if it fits
>         --enable-unicode=ucs4 configures a wide Py_UNICODE likewise

With "likewise" meaning: "and uses wchar_t if it fits" !

>         --enable-unicode      same as "=ucs2"
>     The intention is that --disable-unicode, or --enable-unicode=no
>     removes the Unicode type altogether; this is not yet implemented.

Let's add the UCS-2/UCS-4 stuff first and only then think
about adding the removal #ifdefs.
> Notes
>     Note that len(unichr(i))==2 for i>=2**16 on narrow machines
>     because of the returned surrogates.

-1. See above.
>     This means (for example) that the following code is not portable:
>     x = 2**16
>     if unichr(x) in somestring:
>         ...
>     In general, you should be careful using "in" if the character that
>     is searched for could have been generated from unichr applied to a
>     number greater than 2**16 or from a string literal greater than
>     2**16.
>     This PEP does NOT imply that people using Unicode need to use a
>     4-byte encoding.  It only allows them to do so.  For example,
>     ASCII is still a legitimate (7-bit) Unicode-encoding.
> Rationale for Surrogate Creation Behaviour
>     Python currently supports the construction of a surrogate pair
>     for a large unicode literal character escape sequence. This is
>     basically designed as a simple way to construct "wide characters"
>     even in a narrow Python build.
>         ISSUE: surrogates can be created this way but the user still
>                needs to be careful about slicing, indexing, printing
>                etc. Another option is to remove knowledge of
>                surrogates from everything other than the codecs.

Side note: 

Python uses the unicode-escape codec for interpreting
the Unicode literals. This means that narrow builds will also
support the full range of UCS-4 -- using surrogates if needed.

This introduces an incompatibility between narrow and wide
builds at run-time. PYC should not be harmed by this since they
store Unicode strings using UTF-8.
> Rejected Suggestions
>     There were two primary solutions that were rejected. The first was
>     more or less the status-quo. We could officially say that UTF-16
>     is the Python character encoding and require programmers to
>     implement wide characters in their application logic. This is a
>     heavy burden because emulating 32-bit characters is likely to be
>     very inefficient if it is coded entirely in Python.
>     The other solution is to use UTF-16 (or even UTF-8) internally
>     (for efficiency) but present an abstraction of 32-bit characters
>     to the programmer. This would require a much more complex
>     implementation than the accepted solution. In theory, we could
>     move to this implementation in the future without breaking Python
>     code. It would just emulate a wide Python build on narrow
>     Pythons.
> Copyright
>     This document has been placed in the public domain.
> Local Variables:
> mode: indented-text
> indent-tabs-mode: nil
> End:

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/