[Python-Dev] PEP 261, Rev 1.3 - Support for "wide" Unicode characters
M.-A. Lemburg
mal at lemburg.com
Mon Jul 2 06:13:59 EDT 2001
Paul Prescod wrote:
>
> PEP: 261
> Title: Support for "wide" Unicode characters
> Version: $Revision: 1.3 $
> Author: paulp at activestate.com (Paul Prescod)
> Status: Draft
> Type: Standards Track
> Created: 27-Jun-2001
> Python-Version: 2.2
> Post-History: 27-Jun-2001
>
> Abstract
>
> Python 2.1 unicode characters can have ordinals only up to 2**16
> -1.
> This range corresponds to a range in Unicode known as the Basic
> Multilingual Plane. There are now characters in Unicode that live
> on other "planes". The largest addressable character in Unicode
> has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we
> will call this TOPCHAR and call characters in this range "wide
> characters".
>
> Glossary
>
> Character
>
> Used by itself, means the addressable units of a Python
> Unicode string.
Please add: also known as "code unit".
> Code point
>
> A code point is an integer between 0 and TOPCHAR.
> If you imagine Unicode as a mapping from integers to
> characters, each integer is a code point. But the
> integers between 0 and TOPCHAR that do not map to
> characters are also code points. Some will someday
> be used for characters. Some are guaranteed never
> to be used for characters.
>
> Codec
>
> A set of functions for translating between physical
> encodings (e.g. on disk or coming in from a network)
> into logical Python objects.
>
> Encoding
>
> Mechanism for representing abstract characters in terms of
> physical bits and bytes. Encodings allow us to store
> Unicode characters on disk and transmit them over networks
> in a manner that is compatible with other Unicode software.
>
> Surrogate pair
>
> Two physical characters that represent a single logical
Eeek... two code units (or have you ever seen a physical character
walking around ;-)
> character. Part of a convention for representing 32-bit
> code points in terms of two 16-bit code points.
>
> Unicode string
>
> A Python type representing a sequence of code points with
> "string semantics" (e.g. case conversions, regular
> expression compatibility, etc.) Constructed with the
> unicode() function.
>
> Proposed Solution
>
> One solution would be to merely increase the maximum ordinal
> to a larger value. Unfortunately the only straightforward
> implementation of this idea is to use 4 bytes per character.
> This has the effect of doubling the size of most Unicode
> strings. In order to avoid imposing this cost on every
> user, Python 2.2 will allow the 4-byte implementation as a
> build-time option. Users can choose whether they care about
> wide characters or prefer to preserve memory.
>
> The 4-byte option is called "wide Py_UNICODE". The 2-byte option
> is called "narrow Py_UNICODE".
>
> Most things will behave identically in the wide and narrow worlds.
>
> * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
> length-one string.
>
> * unichr(i) for 2**16 <= i <= TOPCHAR will return a
> length-one string on wide Python builds. On narrow builds it will
> raise ValueError.
>
> ISSUE
>
> Python currently allows \U literals that cannot be
> represented as a single Python character. It generates two
> Python characters known as a "surrogate pair". Should this
> be disallowed on future narrow Python builds?
>
> Pro:
>
> Python already the construction of a surrogate pair
> for a large unicode literal character escape sequence.
> This is basically designed as a simple way to construct
> "wide characters" even in a narrow Python build. It is also
> somewhat logical considering that the Unicode-literal syntax
> is basically a short-form way of invoking the unicode-escape
> codec.
>
> Con:
>
> Surrogates could be easily created this way but the user
> still needs to be careful about slicing, indexing, printing
> etc. Therefore some have suggested that Unicode
> literals should not support surrogates.
>
> ISSUE
>
> Should Python allow the construction of characters that do
> not correspond to Unicode code points? Unassigned Unicode
> code points should obviously be legal (because they could
> be assigned at any time). But code points above TOPCHAR are
> guaranteed never to be used by Unicode. Should we allow
> access
> to them anyhow?
>
> Pro:
>
> If a Python user thinks they know what they're doing why
> should we try to prevent them from violating the Unicode
> spec? After all, we don't stop 8-bit strings from
> containing non-ASCII characters.
>
> Con:
>
> Codecs and other Unicode-consuming code will have to be
> careful of these characters which are disallowed by the
> Unicode specification.
>
> * ord() is always the inverse of unichr()
>
> * There is an integer value in the sys module that describes the
> largest ordinal for a character in a Unicode string on the current
> interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
> of Python and TOPCHAR on wide builds.
>
> ISSUE: Should there be distinct constants for accessing
> TOPCHAR and the real upper bound for the domain of
> unichr (if they differ)? There has also been a
> suggestion of sys.unicodewidth which can take the
> values 'wide' and 'narrow'.
>
> * every Python Unicode character represents exactly one Unicode code
> point (i.e. Python Unicode Character = Abstract Unicode
> character).
>
> * codecs will be upgraded to support "wide characters"
> (represented directly in UCS-4, and as variable-length sequences
> in UTF-8 and UTF-16). This is the main part of the implementation
> left to be done.
>
> * There is a convention in the Unicode world for encoding a 32-bit
> code point in terms of two 16-bit code points. These are known
> as "surrogate pairs". Python's codecs will adopt this convention
> and encode 32-bit code points as surrogate pairs on narrow Python
> builds.
>
> ISSUE
>
> Should there be a way to tell codecs not to generate
> surrogates and instead treat wide characters as
> errors?
>
> Pro:
>
> I might want to write code that works only with
> fixed-width characters and does not have to worry about
> surrogates.
>
> Con:
>
> No clear proposal of how to communicate this to codecs.
No need to pass this information to the codec: simply write
a new one and give it a clear name, e.g. "ucs-2" will generate
errors while "utf-16-le" converts them to surrogates.
> * there are no restrictions on constructing strings that use
> code points "reserved for surrogates" improperly. These are
> called "isolated surrogates". The codecs should disallow reading
> these from files, but you could construct them using string
> literals or unichr().
>
> Implementation
>
> There is a new (experimental) define:
>
> #define PY_UNICODE_SIZE 2
>
> There is a new configure option:
>
> --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
> wchar_t if it fits
> --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
> whchar_t if it fits
> --enable-unicode same as "=ucs2"
>
> The intention is that --disable-unicode, or --enable-unicode=no
> removes the Unicode type altogether; this is not yet implemented.
>
> It is also proposed that one day --enable-unicode will just
> default to the width of your platforms wchar_t.
>
> Windows builds will be narrow for a while based on the fact that
> there have been few requests for wide characters, those requests
> are mostly from hard-core programmers with the ability to buy
> their own Python and Windows itself is strongly biased towards
> 16-bit characters.
>
> Notes
>
> This PEP does NOT imply that people using Unicode need to use a
> 4-byte encoding for their files on disk or sent over the network.
> It only allows them to do so. For example, ASCII is still a
> legitimate (7-bit) Unicode-encoding.
>
> It has been proposed that there should be a module that handles
> surrogates in narrow Python builds for programmers. If someone
> wants to implement that, it will be another PEP. It might also be
> combined with features that allow other kinds of character-,
> word- and line- based indexing.
>
> Rejected Suggestions
>
> More or less the status-quo
>
> We could officially say that Python characters are 16-bit and
> require programmers to implement wide characters in their
> application logic by combining surrogate pairs. This is a heavy
> burden because emulating 32-bit characters is likely to be
> very inefficient if it is coded entirely in Python. Plus these
> abstracted pseudo-strings would not be legal as input to the
> regular expression engine.
>
> "Space-efficient Unicode" type
>
> Another class of solution is to use some efficient storage
> internally but present an abstraction of wide characters to
> the programmer. Any of these would require a much more complex
> implementation than the accepted solution. For instance consider
> the impact on the regular expression engine. In theory, we could
> move to this implementation in the future without breaking
> Python
> code. A future Python could "emulate" wide Python semantics on
> narrow Python. Guido is not willing to undertake the
> implementation right now.
>
> Two types
>
> We could introduce a 32-bit Unicode type alongside the 16-bit
> type. There is a lot of code that expects there to be only a
> single Unicode type.
>
> This PEP represents the least-effort solution. Over the next
> several years, 32-bit Unicode characters will become more common
> and that may either convince us that we need a more sophisticated
> solution or (on the other hand) convince us that simply
> mandating wide Unicode characters is an appropriate solution.
> Right now the two options on the table are do nothing or do
> this.
>
> References
>
> Unicode Glossary: http://www.unicode.org/glossary/
Plus perhaps the Mark Davis paper at:
http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
> Copyright
>
> This document has been placed in the public domain.
Good work, Paul !
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/
More information about the Python-list
mailing list