PEP 261, Rev 1.3 - Support for "wide" Unicode characters
PEP: 261 Title: Support for "wide" Unicode characters Version: $Revision: 1.3 $ Author: paulp@activestate.com (Paul Prescod) Status: Draft Type: Standards Track Created: 27-Jun-2001 Python-Version: 2.2 Post-History: 27-Jun-2001 Abstract Python 2.1 unicode characters can have ordinals only up to 2**16 -1. This range corresponds to a range in Unicode known as the Basic Multilingual Plane. There are now characters in Unicode that live on other "planes". The largest addressable character in Unicode has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we will call this TOPCHAR and call characters in this range "wide characters". Glossary Character Used by itself, means the addressable units of a Python Unicode string. Code point A code point is an integer between 0 and TOPCHAR. If you imagine Unicode as a mapping from integers to characters, each integer is a code point. But the integers between 0 and TOPCHAR that do not map to characters are also code points. Some will someday be used for characters. Some are guaranteed never to be used for characters. Codec A set of functions for translating between physical encodings (e.g. on disk or coming in from a network) into logical Python objects. Encoding Mechanism for representing abstract characters in terms of physical bits and bytes. Encodings allow us to store Unicode characters on disk and transmit them over networks in a manner that is compatible with other Unicode software. Surrogate pair Two physical characters that represent a single logical character. Part of a convention for representing 32-bit code points in terms of two 16-bit code points. Unicode string A Python type representing a sequence of code points with "string semantics" (e.g. case conversions, regular expression compatibility, etc.) Constructed with the unicode() function. Proposed Solution One solution would be to merely increase the maximum ordinal to a larger value. Unfortunately the only straightforward implementation of this idea is to use 4 bytes per character. This has the effect of doubling the size of most Unicode strings. In order to avoid imposing this cost on every user, Python 2.2 will allow the 4-byte implementation as a build-time option. Users can choose whether they care about wide characters or prefer to preserve memory. The 4-byte option is called "wide Py_UNICODE". The 2-byte option is called "narrow Py_UNICODE". Most things will behave identically in the wide and narrow worlds. * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a length-one string. * unichr(i) for 2**16 <= i <= TOPCHAR will return a length-one string on wide Python builds. On narrow builds it will raise ValueError. ISSUE Python currently allows \U literals that cannot be represented as a single Python character. It generates two Python characters known as a "surrogate pair". Should this be disallowed on future narrow Python builds? Pro: Python already the construction of a surrogate pair for a large unicode literal character escape sequence. This is basically designed as a simple way to construct "wide characters" even in a narrow Python build. It is also somewhat logical considering that the Unicode-literal syntax is basically a short-form way of invoking the unicode-escape codec. Con: Surrogates could be easily created this way but the user still needs to be careful about slicing, indexing, printing etc. Therefore some have suggested that Unicode literals should not support surrogates. ISSUE Should Python allow the construction of characters that do not correspond to Unicode code points? Unassigned Unicode code points should obviously be legal (because they could be assigned at any time). But code points above TOPCHAR are guaranteed never to be used by Unicode. Should we allow access to them anyhow? Pro: If a Python user thinks they know what they're doing why should we try to prevent them from violating the Unicode spec? After all, we don't stop 8-bit strings from containing non-ASCII characters. Con: Codecs and other Unicode-consuming code will have to be careful of these characters which are disallowed by the Unicode specification. * ord() is always the inverse of unichr() * There is an integer value in the sys module that describes the largest ordinal for a character in a Unicode string on the current interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds of Python and TOPCHAR on wide builds. ISSUE: Should there be distinct constants for accessing TOPCHAR and the real upper bound for the domain of unichr (if they differ)? There has also been a suggestion of sys.unicodewidth which can take the values 'wide' and 'narrow'. * every Python Unicode character represents exactly one Unicode code point (i.e. Python Unicode Character = Abstract Unicode character). * codecs will be upgraded to support "wide characters" (represented directly in UCS-4, and as variable-length sequences in UTF-8 and UTF-16). This is the main part of the implementation left to be done. * There is a convention in the Unicode world for encoding a 32-bit code point in terms of two 16-bit code points. These are known as "surrogate pairs". Python's codecs will adopt this convention and encode 32-bit code points as surrogate pairs on narrow Python builds. ISSUE Should there be a way to tell codecs not to generate surrogates and instead treat wide characters as errors? Pro: I might want to write code that works only with fixed-width characters and does not have to worry about surrogates. Con: No clear proposal of how to communicate this to codecs. * there are no restrictions on constructing strings that use code points "reserved for surrogates" improperly. These are called "isolated surrogates". The codecs should disallow reading these from files, but you could construct them using string literals or unichr(). Implementation There is a new (experimental) define: #define PY_UNICODE_SIZE 2 There is a new configure option: --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses wchar_t if it fits --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses whchar_t if it fits --enable-unicode same as "=ucs2" The intention is that --disable-unicode, or --enable-unicode=no removes the Unicode type altogether; this is not yet implemented. It is also proposed that one day --enable-unicode will just default to the width of your platforms wchar_t. Windows builds will be narrow for a while based on the fact that there have been few requests for wide characters, those requests are mostly from hard-core programmers with the ability to buy their own Python and Windows itself is strongly biased towards 16-bit characters. Notes This PEP does NOT imply that people using Unicode need to use a 4-byte encoding for their files on disk or sent over the network. It only allows them to do so. For example, ASCII is still a legitimate (7-bit) Unicode-encoding. It has been proposed that there should be a module that handles surrogates in narrow Python builds for programmers. If someone wants to implement that, it will be another PEP. It might also be combined with features that allow other kinds of character-, word- and line- based indexing. Rejected Suggestions More or less the status-quo We could officially say that Python characters are 16-bit and require programmers to implement wide characters in their application logic by combining surrogate pairs. This is a heavy burden because emulating 32-bit characters is likely to be very inefficient if it is coded entirely in Python. Plus these abstracted pseudo-strings would not be legal as input to the regular expression engine. "Space-efficient Unicode" type Another class of solution is to use some efficient storage internally but present an abstraction of wide characters to the programmer. Any of these would require a much more complex implementation than the accepted solution. For instance consider the impact on the regular expression engine. In theory, we could move to this implementation in the future without breaking Python code. A future Python could "emulate" wide Python semantics on narrow Python. Guido is not willing to undertake the implementation right now. Two types We could introduce a 32-bit Unicode type alongside the 16-bit type. There is a lot of code that expects there to be only a single Unicode type. This PEP represents the least-effort solution. Over the next several years, 32-bit Unicode characters will become more common and that may either convince us that we need a more sophisticated solution or (on the other hand) convince us that simply mandating wide Unicode characters is an appropriate solution. Right now the two options on the table are do nothing or do this. References Unicode Glossary: http://www.unicode.org/glossary/ Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
Paul Prescod wrote:
PEP: 261 Title: Support for "wide" Unicode characters Version: $Revision: 1.3 $ Author: paulp@activestate.com (Paul Prescod) Status: Draft Type: Standards Track Created: 27-Jun-2001 Python-Version: 2.2 Post-History: 27-Jun-2001
Abstract
Python 2.1 unicode characters can have ordinals only up to 2**16 -1. This range corresponds to a range in Unicode known as the Basic Multilingual Plane. There are now characters in Unicode that live on other "planes". The largest addressable character in Unicode has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we will call this TOPCHAR and call characters in this range "wide characters".
Glossary
Character
Used by itself, means the addressable units of a Python Unicode string.
Please add: also known as "code unit".
Code point
A code point is an integer between 0 and TOPCHAR. If you imagine Unicode as a mapping from integers to characters, each integer is a code point. But the integers between 0 and TOPCHAR that do not map to characters are also code points. Some will someday be used for characters. Some are guaranteed never to be used for characters.
Codec
A set of functions for translating between physical encodings (e.g. on disk or coming in from a network) into logical Python objects.
Encoding
Mechanism for representing abstract characters in terms of physical bits and bytes. Encodings allow us to store Unicode characters on disk and transmit them over networks in a manner that is compatible with other Unicode software.
Surrogate pair
Two physical characters that represent a single logical
Eeek... two code units (or have you ever seen a physical character walking around ;-)
character. Part of a convention for representing 32-bit code points in terms of two 16-bit code points.
Unicode string
A Python type representing a sequence of code points with "string semantics" (e.g. case conversions, regular expression compatibility, etc.) Constructed with the unicode() function.
Proposed Solution
One solution would be to merely increase the maximum ordinal to a larger value. Unfortunately the only straightforward implementation of this idea is to use 4 bytes per character. This has the effect of doubling the size of most Unicode strings. In order to avoid imposing this cost on every user, Python 2.2 will allow the 4-byte implementation as a build-time option. Users can choose whether they care about wide characters or prefer to preserve memory.
The 4-byte option is called "wide Py_UNICODE". The 2-byte option is called "narrow Py_UNICODE".
Most things will behave identically in the wide and narrow worlds.
* unichr(i) for 0 <= i < 2**16 (0x10000) always returns a length-one string.
* unichr(i) for 2**16 <= i <= TOPCHAR will return a length-one string on wide Python builds. On narrow builds it will raise ValueError.
ISSUE
Python currently allows \U literals that cannot be represented as a single Python character. It generates two Python characters known as a "surrogate pair". Should this be disallowed on future narrow Python builds?
Pro:
Python already the construction of a surrogate pair for a large unicode literal character escape sequence. This is basically designed as a simple way to construct "wide characters" even in a narrow Python build. It is also somewhat logical considering that the Unicode-literal syntax is basically a short-form way of invoking the unicode-escape codec.
Con:
Surrogates could be easily created this way but the user still needs to be careful about slicing, indexing, printing etc. Therefore some have suggested that Unicode literals should not support surrogates.
ISSUE
Should Python allow the construction of characters that do not correspond to Unicode code points? Unassigned Unicode code points should obviously be legal (because they could be assigned at any time). But code points above TOPCHAR are guaranteed never to be used by Unicode. Should we allow access to them anyhow?
Pro:
If a Python user thinks they know what they're doing why should we try to prevent them from violating the Unicode spec? After all, we don't stop 8-bit strings from containing non-ASCII characters.
Con:
Codecs and other Unicode-consuming code will have to be careful of these characters which are disallowed by the Unicode specification.
* ord() is always the inverse of unichr()
* There is an integer value in the sys module that describes the largest ordinal for a character in a Unicode string on the current interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds of Python and TOPCHAR on wide builds.
ISSUE: Should there be distinct constants for accessing TOPCHAR and the real upper bound for the domain of unichr (if they differ)? There has also been a suggestion of sys.unicodewidth which can take the values 'wide' and 'narrow'.
* every Python Unicode character represents exactly one Unicode code point (i.e. Python Unicode Character = Abstract Unicode character).
* codecs will be upgraded to support "wide characters" (represented directly in UCS-4, and as variable-length sequences in UTF-8 and UTF-16). This is the main part of the implementation left to be done.
* There is a convention in the Unicode world for encoding a 32-bit code point in terms of two 16-bit code points. These are known as "surrogate pairs". Python's codecs will adopt this convention and encode 32-bit code points as surrogate pairs on narrow Python builds.
ISSUE
Should there be a way to tell codecs not to generate surrogates and instead treat wide characters as errors?
Pro:
I might want to write code that works only with fixed-width characters and does not have to worry about surrogates.
Con:
No clear proposal of how to communicate this to codecs.
No need to pass this information to the codec: simply write a new one and give it a clear name, e.g. "ucs-2" will generate errors while "utf-16-le" converts them to surrogates.
* there are no restrictions on constructing strings that use code points "reserved for surrogates" improperly. These are called "isolated surrogates". The codecs should disallow reading these from files, but you could construct them using string literals or unichr().
Implementation
There is a new (experimental) define:
#define PY_UNICODE_SIZE 2
There is a new configure option:
--enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses wchar_t if it fits --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses whchar_t if it fits --enable-unicode same as "=ucs2"
The intention is that --disable-unicode, or --enable-unicode=no removes the Unicode type altogether; this is not yet implemented.
It is also proposed that one day --enable-unicode will just default to the width of your platforms wchar_t.
Windows builds will be narrow for a while based on the fact that there have been few requests for wide characters, those requests are mostly from hard-core programmers with the ability to buy their own Python and Windows itself is strongly biased towards 16-bit characters.
Notes
This PEP does NOT imply that people using Unicode need to use a 4-byte encoding for their files on disk or sent over the network. It only allows them to do so. For example, ASCII is still a legitimate (7-bit) Unicode-encoding.
It has been proposed that there should be a module that handles surrogates in narrow Python builds for programmers. If someone wants to implement that, it will be another PEP. It might also be combined with features that allow other kinds of character-, word- and line- based indexing.
Rejected Suggestions
More or less the status-quo
We could officially say that Python characters are 16-bit and require programmers to implement wide characters in their application logic by combining surrogate pairs. This is a heavy burden because emulating 32-bit characters is likely to be very inefficient if it is coded entirely in Python. Plus these abstracted pseudo-strings would not be legal as input to the regular expression engine.
"Space-efficient Unicode" type
Another class of solution is to use some efficient storage internally but present an abstraction of wide characters to the programmer. Any of these would require a much more complex implementation than the accepted solution. For instance consider the impact on the regular expression engine. In theory, we could move to this implementation in the future without breaking Python code. A future Python could "emulate" wide Python semantics on narrow Python. Guido is not willing to undertake the implementation right now.
Two types
We could introduce a 32-bit Unicode type alongside the 16-bit type. There is a lot of code that expects there to be only a single Unicode type.
This PEP represents the least-effort solution. Over the next several years, 32-bit Unicode characters will become more common and that may either convince us that we need a more sophisticated solution or (on the other hand) convince us that simply mandating wide Unicode characters is an appropriate solution. Right now the two options on the table are do nothing or do this.
References
Unicode Glossary: http://www.unicode.org/glossary/
Plus perhaps the Mark Davis paper at: http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
Copyright
This document has been placed in the public domain.
Good work, Paul ! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
"M.-A. Lemburg" wrote:
...
Character
Used by itself, means the addressable units of a Python Unicode string.
Please add: also known as "code unit".
I'm not entirely comfortable with that. As you yourself pointed out, the same Python Unicode object can be interpreted as either a series of single-width code points *or* as a UTF-16 string where the characters are code units. You could also interpet it as a BASE64'd region or an XML document... It all depends on how you look at it.
....
Surrogate pair
Two physical characters that represent a single logical
Eeek... two code units (or have you ever seen a physical character walking around ;-)
No, that's sort of my point. The user can decide to adopt the convention of looking at the two characters as code units or they can ignore that interpretation and look at them as two code points. It's all relative, man. Dig it? That's why I use the word "convention" below:
character. Part of a convention for representing 32-bit code points in terms of two 16-bit code points.
"Surrogates are all in your head. Python doesn't know or care about them!" I'll change this to: Surrogate pair Two Python Unicode characters that represent a single logical Unicode code point. Part of a convention for representing 32-bit code points in terms of two 16-bit code points. Python has limited support for reading, writing and constructing strings that use this convention (described below). Otherwise Python ignores the convention.
No need to pass this information to the codec: simply write a new one and give it a clear name, e.g. "ucs-2" will generate errors while "utf-16-le" converts them to surrogates.
That's a good point, but what if I want a UTF-8 codec that doesn't generate surrogates? Or even a UCS4 one?
Plus perhaps the Mark Davis paper at:
http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
Okay.
Copyright
This document has been placed in the public domain.
Good work, Paul !
Thanks for your help. You did help me to clarify many things even though I argued with you as I was doing it. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
Paul Prescod wrote:
"M.-A. Lemburg" wrote:
...
Character
Used by itself, means the addressable units of a Python Unicode string.
Please add: also known as "code unit".
I'm not entirely comfortable with that. As you yourself pointed out, the same Python Unicode object can be interpreted as either a series of single-width code points *or* as a UTF-16 string where the characters are code units. You could also interpet it as a BASE64'd region or an XML document... It all depends on how you look at it.
Well, that's what code unit tries to capture too: it's the basic storage unit used by the implementation for storing characters. Never mind, it's just a detail...
....
Surrogate pair
Two physical characters that represent a single logical
Eeek... two code units (or have you ever seen a physical character walking around ;-)
No, that's sort of my point. The user can decide to adopt the convention of looking at the two characters as code units or they can ignore that interpretation and look at them as two code points. It's all relative, man. Dig it? That's why I use the word "convention" below:
Ok.
character. Part of a convention for representing 32-bit code points in terms of two 16-bit code points.
"Surrogates are all in your head. Python doesn't know or care about them!"
I'll change this to:
Surrogate pair
Two Python Unicode characters that represent a single logical Unicode code point. Part of a convention for representing 32-bit code points in terms of two 16-bit code points. Python has limited support for reading, writing and constructing strings that use this convention (described below). Otherwise Python ignores the convention.
Good.
No need to pass this information to the codec: simply write a new one and give it a clear name, e.g. "ucs-2" will generate errors while "utf-16-le" converts them to surrogates.
That's a good point, but what if I want a UTF-8 codec that doesn't generate surrogates? Or even a UCS4 one?
With Walter's patch for callback error handlers, you should be able to provide handlers which implement whatever you see fit. I think that codecs should work the same on all platforms and always apply the needed conversion for the platform in question; could be wrong though... it's really only a minor issue.
Plus perhaps the Mark Davis paper at:
http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
Okay.
Copyright
This document has been placed in the public domain.
Good work, Paul !
Thanks for your help. You did help me to clarify many things even though I argued with you as I was doing it.
Thank you for taking the suggestions into account. -- Marc-Andre Lemburg ________________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (2)
-
M.-A. Lemburg
-
Paul Prescod