[I18n-sig] Python Support for "Wide" Unicode characters

Paul Prescod paulp@ActiveState.com
Wed, 27 Jun 2001 15:54:48 -0700


PEP: 261
Title: Python Support for "Wide" Unicode characters
Version: 1.0
Author: paulp@activestate.com (Paul Prescod)
Status: Draft
Type: Standards Track
Python-Version: 2.2
Created: 27-Jun-2001
Post-History: 27-Jun-2001

Abstract

    Python 2.1 unicode characters can have ordinals only up to 65536. 
    These characters are known as Basic Multilinual Plane characters.
    There are now characters in Unicode that live on other "planes".
    The largest addressable character in Unicode has the ordinal
    2**20 + 2**16 - 1. For readability, we will call this TOPCHAR.

Proposed Solution

    One solution would be to merely increase the maximum ordinal to a
    larger value. Unfortunately the only straightforward implementation
    of this idea is to increase the character code unit to 4 bytes. This
    has the effect of doubling the size of most Unicode strings. In
    order to avoid imposing this cost on every user, Python 2.2 will
    allow 4-byte Unicode characters as a build-time option.


    The 4-byte option is called "wide Py_UNICODE". The 2-byte option
    is called "narrow Py_UNICODE".

    Most things will behave identically in the wide and narrow worlds.

    * the \u  and \U literal syntaxes will always generate the same
      data that the unichr function would. They are just different
      syntaxes for the same thing.

    * unichr(i) for 0 <= i <= 2**16 always returns a size-one string.

    * unichr(i) for 2**16+1 <= i <= TOPCHAR will always
      return a string representing the character. 

    * BUT on narrow builds of Python, the string will actually be
      composed of two characters called a "surrogate pair".

    * ord() will now accept surrogate pairs and return the ordinal of
      the "wide" character. Open question: should it accept surrogate
      pairs on wide Python builds?

    * There is an integer value in the sys module that describes the
      largest ordinal for a Unicode character on the current
      interpreter. sys.maxunicode is 2**16-1 on narrow builds of
      Python. On wide builds it could be either TOPCHAR
      or 2**32-1. That's an open question.

    * Note that ord() can in some cases return ordinals
      higher than sys.maxunicode because it accepts surrogate pairs
      on narrow Python builds.

    * codecs will be upgraded to support "wide characters". On narrow
      Python builds, the codecs will generate surrogate pairs, on 
      wide Python builds they will generate a single character.

    * new codecs will be written for 4-byte Unicode and older codecs
      will be updated to recognize surrogates and map them to wide
      characters on wide Pythons.

    * there are no restrictions on constructing strings that use 
      code points "reserved for surrogates" improperly. These are
      called "lone surrogates". The codecs should disallow reading
      these but you could construct them using string literals or
      unichr().

Implementation

    There is a new (experimental) define in Include/unicodeobject.h:

        #undef USE_UCS4_STORAGE

    if defined, Py_UNICODE is set to the same thing as Py_UCS4.

        USE_UCS4_STORAGE

    There is a new configure options:

        --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
                        wchar_t if it fits
        --enable-unicode=ucs4 configures a wide Py_UNICODE likewise
        --enable-unicode      configures Py_UNICODE to wchar_t if
available,
                              and to UCS-4 if not; this is the default

    The intention is that --disable-unicode, or --enable-unicode=no
    removes the Unicode type altogether; this is not yet implemented.

Notes

    Note that len(unichr(i))==2 for i>=0x10000 on narrow machines.

    This means (for example) that the following code is not portable:

    x = 0x10000
    if unichr(x) in somestring:
        ...

    In general, you should be careful using "in" if the character
    that is searched for could have been generated from unichr applied
    to a number greater than 0x10000 or from a string literal greater
    than 0x10000.

    This PEP does NOT imply that people using Unicode need to use a
    4-byte encoding. It only allows them to do so. For example, ASCII
    is still a legitimate (7-bit) Unicode-encoding.

Open Questions

    "Code points" above TOPCHAR cannot be expressed in two 16-bit
    characters. These are not assigned to Unicode characters and 
    supposedly will never be. Should we allow them to be passed as 
    arguments to unichr() anyhow? We could allow knowledgable
    programmers to use these "unused" characters for whatever
    they want, though Unicode does not address them.

    "Lone surrogates" "should not" occur on wide platforms. Should
    ord() still accept them?
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook