PEP: Support for "wide" Unicode characters

Thu Jun 28 18:33:00 EDT 2001

PEP: 261
Title: Support for "wide" Unicode characters
Version: $Revision: 1.3 $
Author: paulp at activestate.com (Paul Prescod)
Status: Draft
Type: Standards Track
Created: 27-Jun-2001
Python-Version: 2.2
Post-History: 27-Jun-2001, 28-Jun-2001


Abstract

    Python 2.1 unicode characters can have ordinals only up to 2**16
-1.  
    These characters are known as Basic Multilinual Plane characters.
    There are now characters in Unicode that live on other "planes".
    The largest addressable character in Unicode has the ordinal 17 *
    2**16 - 1 (0x10ffff). For readability, we will call this TOPCHAR
    and call characters in this range "wide characters".

Glossary

    Character 
        
        Used by itself, means the addressable units of a Python 
        Unicode string.

    Code point

        If you imagine Unicode as a mapping from integers to
        characters, each integer represents a code point. Some are
        really used for characters. Some will someday be used for
        characters. Some are guaranteed never to be used for
        characters.

    Unicode character 

        A code point defined in the Unicode standard whether it is
        already assigned or not. Identified by an integer.

    Code unit

        An integer representing a character in some encoding.

    Surrogate pair

        Two code units that represnt a single Unicode character.

Proposed Solution

    One solution would be to merely increase the maximum ordinal to a
    larger value.  Unfortunately the only straightforward
    implementation of this idea is to increase the character code unit
    to 4 bytes.  This has the effect of doubling the size of most
    Unicode strings.  In order to avoid imposing this cost on every
    user, Python 2.2 will allow 4-byte Unicode characters as a
    build-time option. Users can choose whether they care about
    wide characters or prefer to preserve memory.

    The 4-byte option is called "wide Py_UNICODE".  The 2-byte option
    is called "narrow Py_UNICODE".

    Most things will behave identically in the wide and narrow worlds.

    * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
      length-one string.

    * unichr(i) for 2**16 <= i <= TOPCHAR will return a
      length-one string representing the character on wide Python
      builds. On narrow builds it will return ValueError.

        ISSUE: Python currently allows \U literals that cannot be
               represented as a single character. It generates two
               characters known as a "surrogate pair". Should this be
               disallowed on future narrow Python builds?

        ISSUE: Should Python allow the construction of characters
               that do not correspond to Unicode characters?
               Unassigned Unicode characters should obviously be legal
               (because they could be assigned at any time). But
               code points above TOPCHAR are guaranteed never to 
               be used by Unicode. Should we allow access to them 
               anyhow?

    * ord() is always the inverse of unichr()

    * There is an integer value in the sys module that describes the
      largest ordinal for a Unicode character on the current
      interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
      of Python and TOPCHAR on wide builds.

        ISSUE: Should there be distinct constants for accessing
               TOPCHAR and the real upper bound for the domain of 
               unichr (if they differ)? There has also been a
               suggestion of sys.unicodewith which can take the 
               values 'wide' and 'narrow'.

    * codecs will be upgraded to support "wide characters"
      (represented directly in UCS-4, as surrogate pairs in UTF-16 and
      as multi-byte sequences in UTF-8). On narrow Python builds, the
      codecs will generate surrogate pairs, on wide Python builds they
      will generate a single character. This is the main part of the 
      implementation left to be done.

    * there are no restrictions on constructing strings that use 
      code points "reserved for surrogates" improperly. These are
      called "isolated surrogates". The codecs should disallow reading
      these but you could construct them using string literals or
      unichr(). unichr() is not restricted to values less than either
      TOPCHAR nor sys.maxunicode.


Implementation


    There is a new (experimental) define:

        #define PY_UNICODE_SIZE 2

    There is a new configure options:

        --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
                              wchar_t if it fits
        --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
                              whchar_t if it fits
        --enable-unicode      same as "=ucs2"

    The intention is that --disable-unicode, or --enable-unicode=no
    removes the Unicode type altogether; this is not yet implemented.


Notes

    This PEP does NOT imply that people using Unicode need to use a
    4-byte encoding.  It only allows them to do so.  For example,
    ASCII is still a legitimate (7-bit) Unicode-encoding.


Rationale for Surrogate Creation Behaviour

    Python currently supports the construction of a surrogate pair
    for a large unicode literal character escape sequence. This is
    basically designed as a simple way to construct "wide characters"
    even in a narrow Python build.

        ISSUE: surrogates can be created this way but the user still 
               needs to be careful about slicing, indexing, printing 
               etc. Another option is to remove knowledge of
               surrogates from everything other than the codecs.


Rejected Suggestions

    There were two primary solutions that were rejected. The first was
    more or less the status-quo. We could officially say that Python
    characters represent UTF-16 code units and require programmers to
    implement wide characters in their application logic. This is a
    heavy burden because emulating 32-bit characters is likely to be
    very inefficient if it is coded entirely in Python. Plus these
    abstracted pseudo-strings would not be legal as input to the
    regular expression engine.

    The other class of solution is to use some efficient storage
    internally but present an abstraction of wide characters
    to the programmer. Any of these would require a much more complex
    implementation than the accepted solution. For instance consider
    the impact on the regular expression engine. In theory, we could
    move to this implementation in the future without breaking Python 
    code. A future Python could "emulate" wide Python semantics on 
    narrow Python.


Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
End: