[I18n-sig] Support for "wide" Unicode characters
Paul Prescod
paulp@ActiveState.com
Wed, 27 Jun 2001 22:25:12 -0700
Round 2: I can't check in right now but I'll collect another round of
suggestions and then post this to other lists tomorrow.
----
PEP: 261
Title: Support for "wide" Unicode characters
Version: $Revision: 1.2 $
Author: paulp@activestate.com (Paul Prescod)
Status: Draft
Type: Standards Track
Created: 27-Jun-2001
Python-Version: 2.2
Post-History: 27-Jun-2001
Abstract
Python 2.1 unicode characters can have ordinals only up to 65536.
These characters are known as Basic Multilinual Plane characters.
There are now characters in Unicode that live on other "planes".
The largest addressable character in Unicode has the ordinal
17 * 2**16 - 1. For readability, we will call this TOPCHAR.
Proposed Solution
One solution would be to merely increase the maximum ordinal to a
larger value. Unfortunately the only straightforward
implementation of this idea is to increase the character code unit
to 4 bytes. This has the effect of doubling the size of most
Unicode strings. In order to avoid imposing this cost on every
user, Python 2.2 will allow 4-byte Unicode characters as a
build-time option.
The 4-byte option is called "wide Py_UNICODE". The 2-byte option
is called "narrow Py_UNICODE".
Most things will behave identically in the wide and narrow worlds.
* the \u and \U literal syntaxes will always generate the same
data that the unichr function would. They are just different
syntaxes for the same thing.
* unichr(i) for 0 <= i <= 2**16 always returns a size-one string.
* unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a
string representing the character.
* BUT on narrow builds of Python, the string will actually be
composed of two characters (in the Python, not Unicode sense)
called a "surrogate pair". These two Python characters are
logically one Unicode character.
ISSUE: Should Python return surrogate pairs or narrow builds
or should it just disallow them?
ISSUE: Should the upper bound of the domain of unichr and
range of ord() be TOPCHAR or 2**32-1 or even 2**31?
* ord() will now accept surrogate pairs and return the ordinal of
the "wide" character.
ISSUE: Should Python accept surrogate pairs on wide
Python builds?
* There is an integer value in the sys module that describes the
largest ordinal for a Unicode character on the current
interpreter. sys.maxunicode is 2**16-1 on narrow builds of
Python.
ISSUE: Should sys.maxunicode be TOPCHAR or 2**32-1 or even
2**31 on wide builds?
ISSUE: Should there be distinct constants for accessing
TOPCHAR and the real upper bound for the domain of
unichr?
* Note that ord() can in some cases return ordinals higher than
sys.maxunicode because it accepts surrogate pairs on narrow
Python builds.
* codecs will be upgraded to support "wide characters"
(represented directly in UCS-4, as surrogate pairs in UTF-16 and
as multi-byte sequences in UTF-8). On narrow Python builds, the
codecs will generate surrogate pairs, on wide Python builds they
will generate a single character. This is the main part of the
implementation left to be done.
* there are no restrictions on constructing strings that use
code points "reserved for surrogates" improperly. These are
called "lone surrogates". The codecs should disallow reading
these but you could construct them using string literals or
unichr(). unichr() is not restricted to values less than either
TOPCHAR nor sys.maxunicode.
ISSUE: Should lone surrogates be allowed as input to ord even
on wide platforms where they "should" not occur?
Implementation
There is a new (experimental) define:
#define PY_UNICODE_SIZE 2
There is a new configure options:
--enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
wchar_t if it fits
--enable-unicode=ucs4 configures a wide Py_UNICODE likewise
--enable-unicode same as "=ucs2"
The intention is that --disable-unicode, or --enable-unicode=no
removes the Unicode type altogether; this is not yet implemented.
Notes
Note that len(unichr(i))==2 for i>=2**16 on narrow machines
because of the returned surrogates.
This means (for example) that the following code is not portable:
x = 2**16
if unichr(x) in somestring:
...
In general, you should be careful using "in" if the character that
is searched for could have been generated from unichr applied to a
number greater than 2**16 or from a string literal greater than
2**16.
This PEP does NOT imply that people using Unicode need to use a
4-byte encoding. It only allows them to do so. For example,
ASCII is still a legitimate (7-bit) Unicode-encoding.
Rationale for Surrogate Creation Behaviour
Python currently supports the construction of a surrogate pair
for a large unicode literal character escape sequence. This is
basically designed as a simple way to construct "wide characters"
even in a narrow Python build.
ISSUE: surrogates can be created this way but the user still
needs to be careful about slicing, indexing, printing
etc. Another option is to remove knowledge of
surrogates from everything other than the codecs.
Rejected Suggestions
There were two primary solutions that were rejected. The first was
more or less the status-quo. We could officially say that UTF-16
is the Python character encoding and require programmers to
implement wide characters in their application logic. This is a
heavy burden because emulating 32-bit characters is likely to be
very inefficient if it is coded entirely in Python.
The other solution is to use UTF-16 (or even UTF-8) internally
(for efficiency) but present an abstraction of 32-bit characters
to the programmer. This would require a much more complex
implementation than the accepted solution. In theory, we could
move to this implementation in the future without breaking Python
code. It would just emulate a wide Python build on narrow
Pythons.
Copyright
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:
--
Take a recipe. Leave a recipe.
Python Cookbook! http://www.ActiveState.com/pythoncookbook