[Python-Dev] PEP 261, Rev 1.3 - Support for "wide" Unicode characters

Mon, 02 Jul 2001 12:13:59 +0200

Paul Prescod wrote:
> 
> PEP: 261
> Title: Support for "wide" Unicode characters
> Version: $Revision: 1.3 $
> Author: paulp@activestate.com (Paul Prescod)
> Status: Draft
> Type: Standards Track
> Created: 27-Jun-2001
> Python-Version: 2.2
> Post-History: 27-Jun-2001
> 
> Abstract
> 
>     Python 2.1 unicode characters can have ordinals only up to 2**16
> -1.
>     This range corresponds to a range in Unicode known as the Basic
>     Multilingual Plane. There are now characters in Unicode that live
>     on other "planes". The largest addressable character in Unicode
>     has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we
>     will call this TOPCHAR and call characters in this range "wide
>     characters".
> 
> Glossary
> 
>     Character
> 
>         Used by itself, means the addressable units of a Python
>         Unicode string.

Please add: also known as "code unit".

>     Code point
> 
>         A code point is an integer between 0 and TOPCHAR.
>         If you imagine Unicode as a mapping from integers to
>         characters, each integer is a code point. But the
>         integers between 0 and TOPCHAR that do not map to
>         characters are also code points. Some will someday
>         be used for characters. Some are guaranteed never
>         to be used for characters.
> 
>     Codec
> 
>         A set of functions for translating between physical
>         encodings (e.g. on disk or coming in from a network)
>         into logical Python objects.
> 
>     Encoding
> 
>         Mechanism for representing abstract characters in terms of
>         physical bits and bytes. Encodings allow us to store
>         Unicode characters on disk and transmit them over networks
>         in a manner that is compatible with other Unicode software.
> 
>     Surrogate pair
> 
>         Two physical characters that represent a single logical

Eeek... two code units (or have you ever seen a physical character
walking around ;-)

>         character. Part of a convention for representing 32-bit
>         code points in terms of two 16-bit code points.
> 
>     Unicode string
> 
>           A Python type representing a sequence of code points with
>           "string semantics" (e.g. case conversions, regular
>           expression compatibility, etc.) Constructed with the
>           unicode() function.
> 
> Proposed Solution
> 
>     One solution would be to merely increase the maximum ordinal
>     to a larger value. Unfortunately the only straightforward
>     implementation of this idea is to use 4 bytes per character.
>     This has the effect of doubling the size of most Unicode
>     strings. In order to avoid imposing this cost on every
>     user, Python 2.2 will allow the 4-byte implementation as a
>     build-time option. Users can choose whether they care about
>     wide characters or prefer to preserve memory.
> 
>     The 4-byte option is called "wide Py_UNICODE". The 2-byte option
>     is called "narrow Py_UNICODE".
> 
>     Most things will behave identically in the wide and narrow worlds.
> 
>     * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
>       length-one string.
> 
>     * unichr(i) for 2**16 <= i <= TOPCHAR will return a
>       length-one string on wide Python builds. On narrow builds it will
>       raise ValueError.
> 
>         ISSUE
> 
>             Python currently allows \U literals that cannot be
>             represented as a single Python character. It generates two
>             Python characters known as a "surrogate pair". Should this
>             be disallowed on future narrow Python builds?
> 
>         Pro:
> 
>             Python already the construction of a surrogate pair
>             for a large unicode literal character escape sequence.
>             This is basically designed as a simple way to construct
>             "wide characters" even in a narrow Python build. It is also
>             somewhat logical considering that the Unicode-literal syntax
>             is basically a short-form way of invoking the unicode-escape
>             codec.
> 
>         Con:
> 
>             Surrogates could be easily created this way but the user
>             still needs to be careful about slicing, indexing, printing
>             etc. Therefore some have suggested that Unicode
>             literals should not support surrogates.
> 
>         ISSUE
> 
>             Should Python allow the construction of characters that do
>             not correspond to Unicode code points?  Unassigned Unicode
>             code points should obviously be legal (because they could
>             be assigned at any time). But code points above TOPCHAR are
>             guaranteed never to be used by Unicode. Should we allow
> access
>             to them anyhow?
> 
>         Pro:
> 
>             If a Python user thinks they know what they're doing why
>             should we try to prevent them from violating the Unicode
>             spec? After all, we don't stop 8-bit strings from
>             containing non-ASCII characters.
> 
>         Con:
> 
>             Codecs and other Unicode-consuming code will have to be
>             careful of these characters which are disallowed by the
>             Unicode specification.
> 
>     * ord() is always the inverse of unichr()
> 
>     * There is an integer value in the sys module that describes the
>       largest ordinal for a character in a Unicode string on the current
>       interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
>       of Python and TOPCHAR on wide builds.
> 
>         ISSUE: Should there be distinct constants for accessing
>                TOPCHAR and the real upper bound for the domain of
>                unichr (if they differ)? There has also been a
>                suggestion of sys.unicodewidth which can take the
>                values 'wide' and 'narrow'.
> 
>     * every Python Unicode character represents exactly one Unicode code
>       point (i.e. Python Unicode Character = Abstract Unicode
> character).
> 
>     * codecs will be upgraded to support "wide characters"
>       (represented directly in UCS-4, and as variable-length sequences
>       in UTF-8 and UTF-16). This is the main part of the implementation
>       left to be done.
> 
>     * There is a convention in the Unicode world for encoding a 32-bit
>       code point in terms of two 16-bit code points. These are known
>       as "surrogate pairs". Python's codecs will adopt this convention
>       and encode 32-bit code points as surrogate pairs on narrow Python
>       builds.
> 
>         ISSUE
> 
>             Should there be a way to tell codecs not to generate
>             surrogates and instead treat wide characters as
>             errors?
> 
>         Pro:
> 
>             I might want to write code that works only with
>             fixed-width characters and does not have to worry about
>             surrogates.
> 
>         Con:
> 
>             No clear proposal of how to communicate this to codecs.

No need to pass this information to the codec: simply write
a new one and give it a clear name, e.g. "ucs-2" will generate
errors while "utf-16-le" converts them to surrogates.

>     * there are no restrictions on constructing strings that use
>       code points "reserved for surrogates" improperly. These are
>       called "isolated surrogates". The codecs should disallow reading
>       these from files, but you could construct them using string
>       literals or unichr().
> 
> Implementation
> 
>     There is a new (experimental) define:
> 
>         #define PY_UNICODE_SIZE 2
> 
>     There is a new configure option:
> 
>         --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
>                               wchar_t if it fits
>         --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
>                               whchar_t if it fits
>         --enable-unicode      same as "=ucs2"
> 
>     The intention is that --disable-unicode, or --enable-unicode=no
>     removes the Unicode type altogether; this is not yet implemented.
> 
>     It is also proposed that one day --enable-unicode will just
>     default to the width of your platforms wchar_t.
> 
>     Windows builds will be narrow for a while based on the fact that
>     there have been few requests for wide characters, those requests
>     are mostly from hard-core programmers with the ability to buy
>     their own Python and Windows itself is strongly biased towards
>     16-bit characters.
> 
> Notes
> 
>     This PEP does NOT imply that people using Unicode need to use a
>     4-byte encoding for their files on disk or sent over the network.
>     It only allows them to do so. For example, ASCII is still a
>     legitimate (7-bit) Unicode-encoding.
> 
>     It has been proposed that there should be a module that handles
>     surrogates in narrow Python builds for programmers. If someone
>     wants to implement that, it will be another PEP. It might also be
>     combined with features that allow other kinds of character-,
>     word- and line- based indexing.
> 
> Rejected Suggestions
> 
>     More or less the status-quo
> 
>         We could officially say that Python characters are 16-bit and
>         require programmers to implement wide characters in their
>         application logic by combining surrogate pairs. This is a heavy
>         burden because emulating 32-bit characters is likely to be
>         very inefficient if it is coded entirely in Python. Plus these
>         abstracted pseudo-strings would not be legal as input to the
>         regular expression engine.
> 
>     "Space-efficient Unicode" type
> 
>         Another class of solution is to use some efficient storage
>         internally but present an abstraction of wide characters to
>         the programmer. Any of these would require a much more complex
>         implementation than the accepted solution. For instance consider
>         the impact on the regular expression engine. In theory, we could
>         move to this implementation in the future without breaking
> Python
>         code. A future Python could "emulate" wide Python semantics on
>         narrow Python. Guido is not willing to undertake the
>         implementation right now.
> 
>     Two types
> 
>         We could introduce a 32-bit Unicode type alongside the 16-bit
>         type. There is a lot of code that expects there to be only a
>         single Unicode type.
> 
>     This PEP represents the least-effort solution. Over the next
>     several years, 32-bit Unicode characters will become more common
>     and that may either convince us that we need a more sophisticated
>     solution or (on the other hand) convince us that simply
>     mandating wide Unicode characters is an appropriate solution.
>     Right now the two options on the table are do nothing or do
>     this.
> 
> References
> 
>     Unicode Glossary: http://www.unicode.org/glossary/

Plus perhaps the Mark Davis paper at:

http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/

> Copyright
> 
>     This document has been placed in the public domain.

Good work, Paul !

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/