[Python-Dev] Support for "wide" Unicode characters

M.-A. Lemburg mal@egenix.com
Fri, 29 Jun 2001 16:51:04 +0200


Paul Prescod wrote:
> 
> Slow python-dev day...consider this exiting new proposal to allow deal
> with important new characters like the Japanese dentristy symbols and
> ecological symbols (but not Klingon)

More comments...

> -------- Original Message --------
> Subject: PEP: Support for "wide" Unicode characters
> Date: Thu, 28 Jun 2001 15:33:00 -0700
> From: Paul Prescod <paulp@ActiveState.com>
> Organization: ActiveState
> To: "python-list@python.org" <python-list@python.org>
> 
> PEP: 261
> Title: Support for "wide" Unicode characters
> Version: $Revision: 1.3 $
> Author: paulp@activestate.com (Paul Prescod)
> Status: Draft
> Type: Standards Track
> Created: 27-Jun-2001
> Python-Version: 2.2
> Post-History: 27-Jun-2001, 28-Jun-2001
> 
> Abstract
> 
>     Python 2.1 unicode characters can have ordinals only up to 2**16-1.
>     These characters are known as Basic Multilinual Plane characters.
>     There are now characters in Unicode that live on other "planes".
>     The largest addressable character in Unicode has the ordinal 17 *
>     2**16 - 1 (0x10ffff). For readability, we will call this TOPCHAR
>     and call characters in this range "wide characters".
> 
> Glossary
> 
>     Character
> 
>         Used by itself, means the addressable units of a Python
>         Unicode string.
>
>     Code point
> 
>         If you imagine Unicode as a mapping from integers to
>         characters, each integer represents a code point. Some are
>         really used for characters. Some will someday be used for
>         characters. Some are guaranteed never to be used for
>         characters.
> 
>     Unicode character
> 
>         A code point defined in the Unicode standard whether it is
>         already assigned or not. Identified by an integer.

You're mixing terms here: being a character in Unicode is a
property which is defined by the Unicode specs; not all code
points are characters !

I'd suggest not to use the term character in this PEP at all;
this is also what Mark Davis recommends in his paper on Unicode.

That way people reading the PEP won't even start to confuse things
since they will most likely have to read this glossary to understand
what code point and code units are.

Also, a link to the Unicode glossary would be a good thing.

>     Code unit
> 
>         An integer representing a character in some encoding.

A code unit is the basic storage unit used by Unicode strings,
e.g. u[0], not necessarily a character.
 
>     Surrogate pair
> 
>         Two code units that represnt a single Unicode character.

Please add

      Unicode string

          A sequence of code units.

and a note that on wide builds: code unit == code point.
 
> Proposed Solution
> 
>     One solution would be to merely increase the maximum ordinal to a
>     larger value.  Unfortunately the only straightforward
>     implementation of this idea is to increase the character code unit
>     to 4 bytes.  This has the effect of doubling the size of most
>     Unicode strings.  In order to avoid imposing this cost on every
>     user, Python 2.2 will allow 4-byte Unicode characters as a
>     build-time option. Users can choose whether they care about
>     wide characters or prefer to preserve memory.
> 
>     The 4-byte option is called "wide Py_UNICODE".  The 2-byte option
>     is called "narrow Py_UNICODE".
> 
>     Most things will behave identically in the wide and narrow worlds.
> 
>     * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
>       length-one string.
> 
>     * unichr(i) for 2**16 <= i <= TOPCHAR will return a
>       length-one string representing the character on wide Python
>       builds. On narrow builds it will return ValueError.
> 
>         ISSUE: Python currently allows \U literals that cannot be
>                represented as a single character. It generates two
>                characters known as a "surrogate pair". Should this be
>                disallowed on future narrow Python builds?

Why not make the codec used by Python to convert Unicode
literals to Unicode strings an option just like the default
encoding ?

That way we could have a version of the unicode-escape codec
which supports surrogates and one which doesn't.
 
>         ISSUE: Should Python allow the construction of characters
>                that do not correspond to Unicode characters?
>                Unassigned Unicode characters should obviously be legal
>                (because they could be assigned at any time). But
>                code points above TOPCHAR are guaranteed never to
>                be used by Unicode. Should we allow access to them
>                anyhow?

I wouldn't count on that last point ;-)
 
Please note that you are mixing terms: you don't construct
characters, you construct code points. Whether the concatenation
of these code points makes a valid Unicode character string
is an issue which applications and codecs have to decide.

>     * ord() is always the inverse of unichr()
> 
>     * There is an integer value in the sys module that describes the
>       largest ordinal for a Unicode character on the current
>       interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
>       of Python and TOPCHAR on wide builds.
> 
>         ISSUE: Should there be distinct constants for accessing
>                TOPCHAR and the real upper bound for the domain of
>                unichr (if they differ)? There has also been a
>                suggestion of sys.unicodewith which can take the
>                values 'wide' and 'narrow'.
> 
>     * codecs will be upgraded to support "wide characters"
>       (represented directly in UCS-4, as surrogate pairs in UTF-16 and
>       as multi-byte sequences in UTF-8). On narrow Python builds, the
>       codecs will generate surrogate pairs, on wide Python builds they
>       will generate a single character. This is the main part of the
>       implementation left to be done.
> 
>     * there are no restrictions on constructing strings that use
>       code points "reserved for surrogates" improperly. These are
>       called "isolated surrogates". The codecs should disallow reading
>       these but you could construct them using string literals or
>       unichr(). unichr() is not restricted to values less than either
>       TOPCHAR nor sys.maxunicode.
> 
> Implementation
> 
>     There is a new (experimental) define:
> 
>         #define PY_UNICODE_SIZE 2
> 
>     There is a new configure options:
> 
>         --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
>                               wchar_t if it fits
>         --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
>                               whchar_t if it fits
>         --enable-unicode      same as "=ucs2"
> 
>     The intention is that --disable-unicode, or --enable-unicode=no
>     removes the Unicode type altogether; this is not yet implemented.
> 
> Notes
> 
>     This PEP does NOT imply that people using Unicode need to use a
>     4-byte encoding.  It only allows them to do so.  For example,
>     ASCII is still a legitimate (7-bit) Unicode-encoding.
> 
> Rationale for Surrogate Creation Behaviour
> 
>     Python currently supports the construction of a surrogate pair
>     for a large unicode literal character escape sequence. This is
>     basically designed as a simple way to construct "wide characters"
>     even in a narrow Python build.
> 
>         ISSUE: surrogates can be created this way but the user still
>                needs to be careful about slicing, indexing, printing
>                etc. Another option is to remove knowledge of
>                surrogates from everything other than the codecs.

+1 on removing knowledge about surrogates from the Unicode
implementation core (it's also the easiest: there is none :-)

We should provide a new module which provides a few handy
utilities though: functions which provide code point-, 
character-, word- and line- based indexing into Unicode 
strings.

> Rejected Suggestions
> 
>     There were two primary solutions that were rejected. The first was
>     more or less the status-quo. We could officially say that Python
>     characters represent UTF-16 code units and require programmers to
>     implement wide characters in their application logic. This is a
>     heavy burden because emulating 32-bit characters is likely to be
>     very inefficient if it is coded entirely in Python. Plus these
>     abstracted pseudo-strings would not be legal as input to the
>     regular expression engine.
> 
>     The other class of solution is to use some efficient storage
>     internally but present an abstraction of wide characters
>     to the programmer. Any of these would require a much more complex
>     implementation than the accepted solution. For instance consider
>     the impact on the regular expression engine. In theory, we could
>     move to this implementation in the future without breaking Python
>     code. A future Python could "emulate" wide Python semantics on
>     narrow Python.
> 
> Copyright
> 
>     This document has been placed in the public domain.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/