[I18n-sig] Support for "wide" Unicode characters

Paul Prescod paulp@ActiveState.com
Wed, 27 Jun 2001 22:25:12 -0700


Round 2: I can't check in right now but I'll collect another round of
suggestions and then post this to other lists tomorrow.
----
PEP: 261
Title: Support for "wide" Unicode characters
Version: $Revision: 1.2 $
Author: paulp@activestate.com (Paul Prescod)
Status: Draft
Type: Standards Track
Created: 27-Jun-2001
Python-Version: 2.2
Post-History: 27-Jun-2001


Abstract

    Python 2.1 unicode characters can have ordinals only up to 65536. 
    These characters are known as Basic Multilinual Plane characters.
    There are now characters in Unicode that live on other "planes".
    The largest addressable character in Unicode has the ordinal
    17 * 2**16 - 1. For readability, we will call this TOPCHAR.


Proposed Solution

    One solution would be to merely increase the maximum ordinal to a
    larger value.  Unfortunately the only straightforward
    implementation of this idea is to increase the character code unit
    to 4 bytes.  This has the effect of doubling the size of most
    Unicode strings.  In order to avoid imposing this cost on every
    user, Python 2.2 will allow 4-byte Unicode characters as a
    build-time option.

    The 4-byte option is called "wide Py_UNICODE".  The 2-byte option
    is called "narrow Py_UNICODE".

    Most things will behave identically in the wide and narrow worlds.

    * the \u and \U literal syntaxes will always generate the same
      data that the unichr function would.  They are just different
      syntaxes for the same thing.

    * unichr(i) for 0 <= i <= 2**16 always returns a size-one string.

    * unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a
      string representing the character.

    * BUT on narrow builds of Python, the string will actually be
      composed of two characters (in the Python, not Unicode sense) 
      called a "surrogate pair". These two Python characters are
      logically one Unicode character. 

        ISSUE: Should Python return surrogate pairs or narrow builds
               or should it just disallow them?

        ISSUE: Should the upper bound of the domain of unichr and
               range of ord() be TOPCHAR or 2**32-1 or even 2**31?

    * ord() will now accept surrogate pairs and return the ordinal of
      the "wide" character.  

        ISSUE: Should Python accept surrogate pairs on wide 
               Python builds?

    * There is an integer value in the sys module that describes the
      largest ordinal for a Unicode character on the current
      interpreter. sys.maxunicode is 2**16-1 on narrow builds of
      Python.  

        ISSUE: Should sys.maxunicode be TOPCHAR or 2**32-1 or even
               2**31 on wide builds?

        ISSUE: Should there be distinct constants for accessing
               TOPCHAR and the real upper bound for the domain of 
               unichr?

    * Note that ord() can in some cases return ordinals higher than
      sys.maxunicode because it accepts surrogate pairs on narrow
      Python builds. 

    * codecs will be upgraded to support "wide characters"
      (represented directly in UCS-4, as surrogate pairs in UTF-16 and
      as multi-byte sequences in UTF-8). On narrow Python builds, the
      codecs will generate surrogate pairs, on wide Python builds they
      will generate a single character. This is the main part of the 
      implementation left to be done.

    * there are no restrictions on constructing strings that use 
      code points "reserved for surrogates" improperly. These are
      called "lone surrogates". The codecs should disallow reading
      these but you could construct them using string literals or
      unichr(). unichr() is not restricted to values less than either
      TOPCHAR nor sys.maxunicode.

        ISSUE: Should lone surrogates be allowed as input to ord even
               on wide platforms where they "should" not occur?


Implementation


    There is a new (experimental) define:

        #define PY_UNICODE_SIZE 2

    There is a new configure options:

        --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
                              wchar_t if it fits
        --enable-unicode=ucs4 configures a wide Py_UNICODE likewise
        --enable-unicode      same as "=ucs2"

    The intention is that --disable-unicode, or --enable-unicode=no
    removes the Unicode type altogether; this is not yet implemented.


Notes

    Note that len(unichr(i))==2 for i>=2**16 on narrow machines
    because of the returned surrogates.

    This means (for example) that the following code is not portable:

    x = 2**16
    if unichr(x) in somestring:
        ...

    In general, you should be careful using "in" if the character that
    is searched for could have been generated from unichr applied to a
    number greater than 2**16 or from a string literal greater than
    2**16.

    This PEP does NOT imply that people using Unicode need to use a
    4-byte encoding.  It only allows them to do so.  For example,
    ASCII is still a legitimate (7-bit) Unicode-encoding.

Rationale for Surrogate Creation Behaviour

    Python currently supports the construction of a surrogate pair
    for a large unicode literal character escape sequence. This is
    basically designed as a simple way to construct "wide characters"
    even in a narrow Python build.

        ISSUE: surrogates can be created this way but the user still 
               needs to be careful about slicing, indexing, printing 
               etc. Another option is to remove knowledge of
               surrogates from everything other than the codecs.

Rejected Suggestions

    There were two primary solutions that were rejected. The first was
    more or less the status-quo. We could officially say that UTF-16
    is the Python character encoding and require programmers to
    implement wide characters in their application logic. This is a
    heavy burden because emulating 32-bit characters is likely to be
    very inefficient if it is coded entirely in Python.

    The other solution is to use UTF-16 (or even UTF-8) internally
    (for efficiency) but present an abstraction of 32-bit characters
    to the programmer. This would require a much more complex
    implementation than the accepted solution. In theory, we could
    move to this implementation in the future without breaking Python
    code. It would just emulate a wide Python build on narrow
    Pythons.


Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook