[Fwd: PEP: Support for "wide" Unicode characters]
Slow python-dev day...consider this exiting new proposal to allow deal with important new characters like the Japanese dentristy symbols and ecological symbols (but not Klingon) -------- Original Message -------- Subject: PEP: Support for "wide" Unicode characters Date: Thu, 28 Jun 2001 15:33:00 -0700 From: Paul Prescod <paulp@ActiveState.com> Organization: ActiveState To: "python-list@python.org" <python-list@python.org> PEP: 261 Title: Support for "wide" Unicode characters Version: $Revision: 1.3 $ Author: paulp@activestate.com (Paul Prescod) Status: Draft Type: Standards Track Created: 27-Jun-2001 Python-Version: 2.2 Post-History: 27-Jun-2001, 28-Jun-2001 Abstract Python 2.1 unicode characters can have ordinals only up to 2**16 -1. These characters are known as Basic Multilinual Plane characters. There are now characters in Unicode that live on other "planes". The largest addressable character in Unicode has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we will call this TOPCHAR and call characters in this range "wide characters". Glossary Character Used by itself, means the addressable units of a Python Unicode string. Code point If you imagine Unicode as a mapping from integers to characters, each integer represents a code point. Some are really used for characters. Some will someday be used for characters. Some are guaranteed never to be used for characters. Unicode character A code point defined in the Unicode standard whether it is already assigned or not. Identified by an integer. Code unit An integer representing a character in some encoding. Surrogate pair Two code units that represnt a single Unicode character. Proposed Solution One solution would be to merely increase the maximum ordinal to a larger value. Unfortunately the only straightforward implementation of this idea is to increase the character code unit to 4 bytes. This has the effect of doubling the size of most Unicode strings. In order to avoid imposing this cost on every user, Python 2.2 will allow 4-byte Unicode characters as a build-time option. Users can choose whether they care about wide characters or prefer to preserve memory. The 4-byte option is called "wide Py_UNICODE". The 2-byte option is called "narrow Py_UNICODE". Most things will behave identically in the wide and narrow worlds. * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a length-one string. * unichr(i) for 2**16 <= i <= TOPCHAR will return a length-one string representing the character on wide Python builds. On narrow builds it will return ValueError. ISSUE: Python currently allows \U literals that cannot be represented as a single character. It generates two characters known as a "surrogate pair". Should this be disallowed on future narrow Python builds? ISSUE: Should Python allow the construction of characters that do not correspond to Unicode characters? Unassigned Unicode characters should obviously be legal (because they could be assigned at any time). But code points above TOPCHAR are guaranteed never to be used by Unicode. Should we allow access to them anyhow? * ord() is always the inverse of unichr() * There is an integer value in the sys module that describes the largest ordinal for a Unicode character on the current interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds of Python and TOPCHAR on wide builds. ISSUE: Should there be distinct constants for accessing TOPCHAR and the real upper bound for the domain of unichr (if they differ)? There has also been a suggestion of sys.unicodewith which can take the values 'wide' and 'narrow'. * codecs will be upgraded to support "wide characters" (represented directly in UCS-4, as surrogate pairs in UTF-16 and as multi-byte sequences in UTF-8). On narrow Python builds, the codecs will generate surrogate pairs, on wide Python builds they will generate a single character. This is the main part of the implementation left to be done. * there are no restrictions on constructing strings that use code points "reserved for surrogates" improperly. These are called "isolated surrogates". The codecs should disallow reading these but you could construct them using string literals or unichr(). unichr() is not restricted to values less than either TOPCHAR nor sys.maxunicode. Implementation There is a new (experimental) define: #define PY_UNICODE_SIZE 2 There is a new configure options: --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses wchar_t if it fits --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses whchar_t if it fits --enable-unicode same as "=ucs2" The intention is that --disable-unicode, or --enable-unicode=no removes the Unicode type altogether; this is not yet implemented. Notes This PEP does NOT imply that people using Unicode need to use a 4-byte encoding. It only allows them to do so. For example, ASCII is still a legitimate (7-bit) Unicode-encoding. Rationale for Surrogate Creation Behaviour Python currently supports the construction of a surrogate pair for a large unicode literal character escape sequence. This is basically designed as a simple way to construct "wide characters" even in a narrow Python build. ISSUE: surrogates can be created this way but the user still needs to be careful about slicing, indexing, printing etc. Another option is to remove knowledge of surrogates from everything other than the codecs. Rejected Suggestions There were two primary solutions that were rejected. The first was more or less the status-quo. We could officially say that Python characters represent UTF-16 code units and require programmers to implement wide characters in their application logic. This is a heavy burden because emulating 32-bit characters is likely to be very inefficient if it is coded entirely in Python. Plus these abstracted pseudo-strings would not be legal as input to the regular expression engine. The other class of solution is to use some efficient storage internally but present an abstraction of wide characters to the programmer. Any of these would require a much more complex implementation than the accepted solution. For instance consider the impact on the regular expression engine. In theory, we could move to this implementation in the future without breaking Python code. A future Python could "emulate" wide Python semantics on narrow Python. Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: -- http://mail.python.org/mailman/listinfo/python-list
Paul Prescod wrote:
Slow python-dev day...consider this exiting new proposal to allow deal with important new characters like the Japanese dentristy symbols and ecological symbols (but not Klingon)
More comments...
-------- Original Message -------- Subject: PEP: Support for "wide" Unicode characters Date: Thu, 28 Jun 2001 15:33:00 -0700 From: Paul Prescod <paulp@ActiveState.com> Organization: ActiveState To: "python-list@python.org" <python-list@python.org>
PEP: 261 Title: Support for "wide" Unicode characters Version: $Revision: 1.3 $ Author: paulp@activestate.com (Paul Prescod) Status: Draft Type: Standards Track Created: 27-Jun-2001 Python-Version: 2.2 Post-History: 27-Jun-2001, 28-Jun-2001
Abstract
Python 2.1 unicode characters can have ordinals only up to 2**16-1. These characters are known as Basic Multilinual Plane characters. There are now characters in Unicode that live on other "planes". The largest addressable character in Unicode has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we will call this TOPCHAR and call characters in this range "wide characters".
Glossary
Character
Used by itself, means the addressable units of a Python Unicode string.
Code point
If you imagine Unicode as a mapping from integers to characters, each integer represents a code point. Some are really used for characters. Some will someday be used for characters. Some are guaranteed never to be used for characters.
Unicode character
A code point defined in the Unicode standard whether it is already assigned or not. Identified by an integer.
You're mixing terms here: being a character in Unicode is a property which is defined by the Unicode specs; not all code points are characters ! I'd suggest not to use the term character in this PEP at all; this is also what Mark Davis recommends in his paper on Unicode. That way people reading the PEP won't even start to confuse things since they will most likely have to read this glossary to understand what code point and code units are. Also, a link to the Unicode glossary would be a good thing.
Code unit
An integer representing a character in some encoding.
A code unit is the basic storage unit used by Unicode strings, e.g. u[0], not necessarily a character.
Surrogate pair
Two code units that represnt a single Unicode character.
Please add Unicode string A sequence of code units. and a note that on wide builds: code unit == code point.
Proposed Solution
One solution would be to merely increase the maximum ordinal to a larger value. Unfortunately the only straightforward implementation of this idea is to increase the character code unit to 4 bytes. This has the effect of doubling the size of most Unicode strings. In order to avoid imposing this cost on every user, Python 2.2 will allow 4-byte Unicode characters as a build-time option. Users can choose whether they care about wide characters or prefer to preserve memory.
The 4-byte option is called "wide Py_UNICODE". The 2-byte option is called "narrow Py_UNICODE".
Most things will behave identically in the wide and narrow worlds.
* unichr(i) for 0 <= i < 2**16 (0x10000) always returns a length-one string.
* unichr(i) for 2**16 <= i <= TOPCHAR will return a length-one string representing the character on wide Python builds. On narrow builds it will return ValueError.
ISSUE: Python currently allows \U literals that cannot be represented as a single character. It generates two characters known as a "surrogate pair". Should this be disallowed on future narrow Python builds?
Why not make the codec used by Python to convert Unicode literals to Unicode strings an option just like the default encoding ? That way we could have a version of the unicode-escape codec which supports surrogates and one which doesn't.
ISSUE: Should Python allow the construction of characters that do not correspond to Unicode characters? Unassigned Unicode characters should obviously be legal (because they could be assigned at any time). But code points above TOPCHAR are guaranteed never to be used by Unicode. Should we allow access to them anyhow?
I wouldn't count on that last point ;-) Please note that you are mixing terms: you don't construct characters, you construct code points. Whether the concatenation of these code points makes a valid Unicode character string is an issue which applications and codecs have to decide.
* ord() is always the inverse of unichr()
* There is an integer value in the sys module that describes the largest ordinal for a Unicode character on the current interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds of Python and TOPCHAR on wide builds.
ISSUE: Should there be distinct constants for accessing TOPCHAR and the real upper bound for the domain of unichr (if they differ)? There has also been a suggestion of sys.unicodewith which can take the values 'wide' and 'narrow'.
* codecs will be upgraded to support "wide characters" (represented directly in UCS-4, as surrogate pairs in UTF-16 and as multi-byte sequences in UTF-8). On narrow Python builds, the codecs will generate surrogate pairs, on wide Python builds they will generate a single character. This is the main part of the implementation left to be done.
* there are no restrictions on constructing strings that use code points "reserved for surrogates" improperly. These are called "isolated surrogates". The codecs should disallow reading these but you could construct them using string literals or unichr(). unichr() is not restricted to values less than either TOPCHAR nor sys.maxunicode.
Implementation
There is a new (experimental) define:
#define PY_UNICODE_SIZE 2
There is a new configure options:
--enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses wchar_t if it fits --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses whchar_t if it fits --enable-unicode same as "=ucs2"
The intention is that --disable-unicode, or --enable-unicode=no removes the Unicode type altogether; this is not yet implemented.
Notes
This PEP does NOT imply that people using Unicode need to use a 4-byte encoding. It only allows them to do so. For example, ASCII is still a legitimate (7-bit) Unicode-encoding.
Rationale for Surrogate Creation Behaviour
Python currently supports the construction of a surrogate pair for a large unicode literal character escape sequence. This is basically designed as a simple way to construct "wide characters" even in a narrow Python build.
ISSUE: surrogates can be created this way but the user still needs to be careful about slicing, indexing, printing etc. Another option is to remove knowledge of surrogates from everything other than the codecs.
+1 on removing knowledge about surrogates from the Unicode implementation core (it's also the easiest: there is none :-) We should provide a new module which provides a few handy utilities though: functions which provide code point-, character-, word- and line- based indexing into Unicode strings.
Rejected Suggestions
There were two primary solutions that were rejected. The first was more or less the status-quo. We could officially say that Python characters represent UTF-16 code units and require programmers to implement wide characters in their application logic. This is a heavy burden because emulating 32-bit characters is likely to be very inefficient if it is coded entirely in Python. Plus these abstracted pseudo-strings would not be legal as input to the regular expression engine.
The other class of solution is to use some efficient storage internally but present an abstraction of wide characters to the programmer. Any of these would require a much more complex implementation than the accepted solution. For instance consider the impact on the regular expression engine. In theory, we could move to this implementation in the future without breaking Python code. A future Python could "emulate" wide Python semantics on narrow Python.
Copyright
This document has been placed in the public domain.
-- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
I'd suggest not to use the term character in this PEP at all; this is also what Mark Davis recommends in his paper on Unicode.
I like this idea! I know that I *still* have a hard time not to think "C 'char' datatype, i.e. an 8-bit byte" when I read "character"...
Why not make the codec used by Python to convert Unicode literals to Unicode strings an option just like the default encoding ?
That way we could have a version of the unicode-escape codec which supports surrogates and one which doesn't.
Smart idea, but how practical is this? Can you spec this out a bit more?
+1 on removing knowledge about surrogates from the Unicode implementation core (it's also the easiest: there is none :-)
Except for \U currently -- or is that not part of the implementation core?
We should provide a new module which provides a few handy utilities though: functions which provide code point-, character-, word- and line- based indexing into Unicode strings.
But its design is outside the scope of this PEP, I'd say. --Guido van Rossum (home page: http://www.python.org/~guido/)
"M.-A. Lemburg" wrote:
...
I'd suggest not to use the term character in this PEP at all; this is also what Mark Davis recommends in his paper on Unicode.
That's fine, but Python does have a concept of character and I'm going to use the term character for discussing these.
Also, a link to the Unicode glossary would be a good thing.
Funny how these little PEPs grow...
... Why not make the codec used by Python to convert Unicode literals to Unicode strings an option just like the default encoding ?
That way we could have a version of the unicode-escape codec which supports surrogates and one which doesn't.
Adding more and more knobs to tweak just adds up to Python code being non-portable from one machine to another.
ISSUE: Should Python allow the construction of characters that do not correspond to Unicode characters? Unassigned Unicode characters should obviously be legal (because they could be assigned at any time). But code points above TOPCHAR are guaranteed never to be used by Unicode. Should we allow access to them anyhow?
I wouldn't count on that last point ;-)
Please note that you are mixing terms: you don't construct characters, you construct code points. Whether the concatenation of these code points makes a valid Unicode character string is an issue which applications and codecs have to decide.
unichr() does not construct code points. It constructs 1-char Python Unicode strings...also known as Python Unicode characters.
... Whether the concatenation of these code points makes a valid Unicode character string is an issue which applications and codecs have to decide.
The concatenation of true code points would *always* make a valid Unicode string, right? It's code units that cannot be blindly concatenated.
... We should provide a new module which provides a few handy utilities though: functions which provide code point-, character-, word- and line- based indexing into Unicode strings.
Okay, I'll add: It has been proposed that there should be a module for working with UTF-16 strings in narrow Python builds through some sort of abstraction that handles surrogates for you. If someone wants to implement that, it will be another PEP. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
Paul Prescod wrote:
"M.-A. Lemburg" wrote:
...
I'd suggest not to use the term character in this PEP at all; this is also what Mark Davis recommends in his paper on Unicode.
That's fine, but Python does have a concept of character and I'm going to use the term character for discussing these.
The term "character" in Python should really only be used for the 8-bit strings. In Unicode a "character" can mean any of: """ Unfortunately the term character is vastly overloaded. At various times people can use it to mean any of these things: - An image on paper (glyph) - What an end-user thinks of as a character (grapheme) - What a character encoding standard encodes (code point) - A memory storage unit in a character encoding (code unit) Because of this, ironically, it is best to avoid the use of the term character entirely when discussing character encodings, and stick to the term code point. """ Taken from Mark Davis' paper: http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
Also, a link to the Unicode glossary would be a good thing.
Funny how these little PEPs grow...
Is that a problem ? The Unicode glossary is very useful in providing a common base for understanding the different terms and tries very hard to avoid ambiguity in meaning. This discussion is partly caused by exactly these different understanding of the terms used in the PEP. I will update the Unicode PEP to the Unicode terminology too.
... Why not make the codec used by Python to convert Unicode literals to Unicode strings an option just like the default encoding ?
That way we could have a version of the unicode-escape codec which supports surrogates and one which doesn't.
Adding more and more knobs to tweak just adds up to Python code being non-portable from one machine to another.
Not necessarily so; I'll write a more precise spec next week. The idea is to put the codec information into the Python source code, so that it is bound to the literals that way with the result of the Python source code being portable across platforms. Currently this is just an idea and still have to check how far this can go...
ISSUE: Should Python allow the construction of characters that do not correspond to Unicode characters? Unassigned Unicode characters should obviously be legal (because they could be assigned at any time). But code points above TOPCHAR are guaranteed never to be used by Unicode. Should we allow access to them anyhow?
I wouldn't count on that last point ;-)
Please note that you are mixing terms: you don't construct characters, you construct code points. Whether the concatenation of these code points makes a valid Unicode character string is an issue which applications and codecs have to decide.
unichr() does not construct code points. It constructs 1-char Python Unicode strings...also known as Python Unicode characters.
... Whether the concatenation of these code points makes a valid Unicode character string is an issue which applications and codecs have to decide.
The concatenation of true code points would *always* make a valid Unicode string, right? It's code units that cannot be blindly concatenated.
Both wrong :-) U+D800 is a valid Unicode code point and can occur as code unit in both narrow and wide builds. Concatenating this with e.g. U+0020 will still make it a valid Unicode code point sequence (aka Unicode object), but not a valid Unicode character string (since the U+D800 is not a character). The same is true for e.g. U+FFFF. Note that the Unicode type should happily store these values, while the codecs complain. As a result and like I said above, dealing with these problems is left to the applications which use these Unicode objects.
... We should provide a new module which provides a few handy utilities though: functions which provide code point-, character-, word- and line- based indexing into Unicode strings.
Okay, I'll add:
It has been proposed that there should be a module for working with UTF-16 strings in narrow Python builds through some sort of abstraction that handles surrogates for you. If someone wants to implement that, it will be another PEP.
Uhm, narrow builds don't support UTF-16... it's UCS-2 which is supported (basically: store everything in range(0x10000)); the codecs can map code points to surrogates, but it is solely their responsibility and the responsibility of the application using them to take care of dealing with surrogates. Also, the module will be useful for both narrow and wide builds, since the notion of an encoded character can involve multiple code points. In that sense Unicode is always a variable length encoding for characters and that's the application field of this module. Here's the adjusted text: It has been proposed that there should be a module for working with Unicode objects using character-, word- and line- based indexing. The details of the implementation is left to another PEP. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
"M.-A. Lemburg" wrote:
...
The term "character" in Python should really only be used for the 8-bit strings.
Are we going to change chr() and unichr() to one_element_string() and unicode_one_element_string() u[i] is a character. If u is Unicode, then u[i] is a Python Unicode character. No Python user will find that confusing no matter how Unicode knuckle-dragging, mouth-breathing, wife-by-hair-dragging they are.
In Unicode a "character" can mean any of:
Mark Davis said that "people" can use the word to mean any of those things. He did not say that it was imprecisely defined in Unicode. Nevertheless I'm not using the Unicode definition anymore than our standard library uses an ancient Greek definition of integer. Python has a concept of integer and a concept of character.
It has been proposed that there should be a module for working with UTF-16 strings in narrow Python builds through some sort of abstraction that handles surrogates for you. If someone wants to implement that, it will be another PEP.
Uhm, narrow builds don't support UTF-16... it's UCS-2 which is supported (basically: store everything in range(0x10000)); the codecs can map code points to surrogates, but it is solely their responsibility and the responsibility of the application using them to take care of dealing with surrogates.
The user can view the data as UCS-2, UTF-16, Base64, ROT-13, XML, .... Just as we have a base64 module, we could have a UTF-16 module that interprets the data in the string as UTF-16 and does surrogate manipulation for you. Anyhow, if any of those is the "real" encoding of the data, it is UTF-16. After all, if the codec reads in four non-BMP characters in, let's say, UTF-8, we represent them as 8 narrow-build Python characters. That's the definition of UTF-16! But it's easy enough for me to take that word out so I will.
... Also, the module will be useful for both narrow and wide builds, since the notion of an encoded character can involve multiple code points. In that sense Unicode is always a variable length encoding for characters and that's the application field of this module.
I wouldn't advise that you do all different types of normalization in a single module but I'll wait for your PEP.
Here's the adjusted text:
It has been proposed that there should be a module for working with Unicode objects using character-, word- and line- based indexing. The details of the implementation is left to another PEP.
It has been proposed that there should be a module that handles surrogates in narrow Python builds for programmers. If someone wants to implement that, it will be another PEP. It might also be combined with features that allow other kinds of character-, word- and line- based indexing. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
Paul Prescod: <PEP: 261> The problem I have with this PEP is that it is a compile time option which makes it hard to work with both 32 bit and 16 bit strings in one program. Can not the 32 bit string type be introduced as an additional type?
Are we going to change chr() and unichr() to one_element_string() and unicode_one_element_string()
u[i] is a character. If u is Unicode, then u[i] is a Python Unicode character.
This wasn't usefully true in the past for DBCS strings and is not the right way to think of either narrow or wide strings now. The idea that strings are arrays of characters gets in the way of dealing with many encodings and is the primary difficulty in localising software for Japanese. Iteration through the code units in a string is a problem waiting to bite you and string APIs should encourage behaviour which is correct when faced with variable width characters, both DBCS and UTF style. Iteration over variable width characters should be performed in a way that preserves the integrity of the characters. M.-A. Lemburg's proposed set of iterators could be extended to indicate encoding "for c in s.asCharacters('utf-8')" and to provide for the various intended string uses such as "for c in s.inVisualOrder()" reversing the receipt of right-to-left substrings. Neil
<PEP: 261>
The problem I have with this PEP is that it is a compile time option which makes it hard to work with both 32 bit and 16 bit strings in one program. Can not the 32 bit string type be introduced as an additional type?
Not without an outrageous amount of additional coding (every place in the code that currently uses PyUnicode_Check() would have to be bifurcated in a 16-bit and a 32-bit variant). I doubt that the desire to work with both 16- and 32-bit characters in one program is typical for folks using Unicode -- that's mostly limited to folks writing conversion tools. Python will offer the necessary codecs so you shouldn't have this need very often. You can use the array module to manipulate 16- and 32-bit arrays, and you can use the various Unicode encodings to do the necessary encodings.
u[i] is a character. If u is Unicode, then u[i] is a Python Unicode character.
This wasn't usefully true in the past for DBCS strings and is not the right way to think of either narrow or wide strings now. The idea that strings are arrays of characters gets in the way of dealing with many encodings and is the primary difficulty in localising software for Japanese.
Can you explain the kind of problems encountered in some more detail?
Iteration through the code units in a string is a problem waiting to bite you and string APIs should encourage behaviour which is correct when faced with variable width characters, both DBCS and UTF style.
But this is not the Unicode philosophy. All the variable-length character manipulation is supposed to be taken care of by the codecs, and then the application can deal in arrays of characteres. Alternatively, the application can deal in opaque objects representing variable-length encodings, but then it should be very careful with concatenation and even more so with slicing.
Iteration over variable width characters should be performed in a way that preserves the integrity of the characters. M.-A. Lemburg's proposed set of iterators could be extended to indicate encoding "for c in s.asCharacters('utf-8')" and to provide for the various intended string uses such as "for c in s.inVisualOrder()" reversing the receipt of right-to-left substrings.
I think it's a good idea to provide a set of higher-level tools as well. However nobody seems to know what these higher-level tools should do yet. PEP 261 is specifically focused on getting the lower-level foundations right (i.e. the objects that represent arrays of code units), so that the authors of higher level tools will have a solid base. If you want to help author a PEP for such higher-level tools, you're welcome! --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
<PEP: 261>
The problem I have with this PEP is that it is a compile time option which makes it hard to work with both 32 bit and 16 bit strings in one program. Can not the 32 bit string type be introduced as an additional type?
Not without an outrageous amount of additional coding (every place in the code that currently uses PyUnicode_Check() would have to be bifurcated in a 16-bit and a 32-bit variant).
Alternatively, a Unicode object could *internally* be either 8, 16 or 32 bits wide (to be clear: not per character, but per string). Also a lot of work, but it'll be a lot less wasteful.
I doubt that the desire to work with both 16- and 32-bit characters in one program is typical for folks using Unicode -- that's mostly limited to folks writing conversion tools. Python will offer the necessary codecs so you shouldn't have this need very often.
Not a lot of people will want to work with 16 or 32 bit chars directly, but I think a less wasteful solution to the surrogate pair problem *will* be desired by people. Why use 32 bits for all strings in a program when only a tiny percentage actually *needs* more than 16? (Or even 8...)
Iteration through the code units in a string is a problem waiting to bite you and string APIs should encourage behaviour which is correct when faced with variable width characters, both DBCS and UTF style.
But this is not the Unicode philosophy. All the variable-length character manipulation is supposed to be taken care of by the codecs, and then the application can deal in arrays of characteres.
Right: this is the way it should be. My difficulty with PEP 261 is that I'm afraid few people will actually enable 32-bit support (*what*?! all unicode strings become 32 bits wide? no way!), therefore making programs non-portable in very subtle ways. Just
Just van Rossum wrote:
Guido van Rossum wrote:
<PEP: 261>
The problem I have with this PEP is that it is a compile time option which makes it hard to work with both 32 bit and 16 bit strings in one program. Can not the 32 bit string type be introduced as an additional type?
Not without an outrageous amount of additional coding (every place in the code that currently uses PyUnicode_Check() would have to be bifurcated in a 16-bit and a 32-bit variant).
Alternatively, a Unicode object could *internally* be either 8, 16 or 32 bits wide (to be clear: not per character, but per string). Also a lot of work, but it'll be a lot less wasteful.
I hope this is where we end up one day. But the compile-time option is better than where we are today. Even though PEP 261 is not my favorite solution, it buys us a couple of years of wait-and-see time. Consider that computer memory is growing much faster than textual data. People's text processing techniques get more and more "wasteful" because it is now almost always possible to load the entire "text" into memory at once. I remember how some text editors used to boast that they only loaded your text "on demand". Maybe so much data will be passed to us from UCS-4 APIs that trying to "compress it" will actually be inefficient. Maybe two years from now Guido will make UCS-4 the default and only a tiny minority will notice or care.
... My difficulty with PEP 261 is that I'm afraid few people will actually enable 32-bit support (*what*?! all unicode strings become 32 bits wide? no way!), therefore making programs non-portable in very subtle ways.
It really depends on what the default build option is. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
[Paul Prescod]
... Consider that computer memory is growing much faster than textual data. People's text processing techniques get more and more "wasteful" because it is now almost always possible to load the entire "text" into memory at once.
Indeed, the entire text of the Bible fits in a corner of my year-old box's RAM, even at 32 bits per character.
I remember how some text editors used to boast that they only loaded your text "on demand".
Well, they still do -- fancy editors use fancy data structures, so that, e.g., inserting characters at the start of the file doesn't cause a 50Mb memmove each time. Response time is still important, but I'd wager relatively insensitive to basic character size (you need tricks that cut factors of 1000s off potential worst cases to give the appearance of instantaneous results; a factor of 2 or 4 is in the noise compared to what's needed regardless).
Tim Peters wrote:
...
I remember how some text editors used to boast that they only loaded your text "on demand".
Well, they still do -- fancy editors use fancy data structures, so that, e.g., inserting characters at the start of the file doesn't cause a 50Mb memmove each time.
Yes, but most modern text editors take O(n) time to open the file. There was a time when the more advanced ones did not. Or maybe that was just SGML editors...I can't remember. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
Tim Peters:
Well, they still do -- fancy editors use fancy data structures, so that, e.g., inserting characters at the start of the file doesn't cause a 50Mb memmove each time. Response time is still important, but I'd wager relatively insensitive to basic character size (you need tricks that cut factors of 1000s off potential worst cases to give the appearance of instantaneous results; a factor of 2 or 4 is in the noise compared to what's needed regardless).
I actually have some numbers here. Early versions of some new editor buffer code used UCS-2 on .NET and the JVM. Moving to an 8 bit buffer saved 10-20% of execution time on the insert string, delete string and global replace benchmarks using strings that fit into ASCII. These buffers did have some other overhead for line management and other features but I expect these did not affect the proportions much. Neil
Alternatively, a Unicode object could *internally* be either 8, 16 or 32 bits wide (to be clear: not per character, but per string). Also a lot of work, but it'll be a lot less wasteful.
Depending on what you prefer to waste: developers' time or computer resources. I bet that if you try the measure the wasted space you'll find that it wastes very little compared to all the other overheads in a typical Python program: CPU time compared to writing your code in C, memory overhead for integers, etc. It so happened that the Unicode support was written to make it very easy to change the compile-time code unit size; but making this a per-string (or even global) run-time variable is much harder without touching almost every place that uses Unicode (not to mention slowing down the common case). Nobody was enthusiastic about fixing this, so our choice was really between staying with 16 bits or making 32 bits an option for those who need it.
Not a lot of people will want to work with 16 or 32 bit chars directly,
How do you know? There are more Chinese than Americans and Europeans together, and they will soon all have computers. :-)
but I think a less wasteful solution to the surrogate pair problem *will* be desired by people. Why use 32 bits for all strings in a program when only a tiny percentage actually *needs* more than 16? (Or even 8...)
So work in UTF-8 -- a lot of work can be done in UTF-8.
But this is not the Unicode philosophy. All the variable-length character manipulation is supposed to be taken care of by the codecs, and then the application can deal in arrays of characteres.
Right: this is the way it should be.
My difficulty with PEP 261 is that I'm afraid few people will actually enable 32-bit support (*what*?! all unicode strings become 32 bits wide? no way!), therefore making programs non-portable in very subtle ways.
My hope and expectation is that those folks who need 32-bit support will enable it. If this solution is not sufficient, we may have to provide something else in the future, but given that the implementation effort for PEP 261 was very minimal (certainly less than the time expended in discussing it) I am very happy with it. It will take quite a while until lots of folks will need the 32-bit support (there aren't that many characters defined outside the basic plane yet). In the mean time, those that need to 32-bit support should be happy that we allow them to rebuild Python with 32-bit support. In the next 5-10 years, the 32-bit support requirement will become more common -- as will be the memory upgrades to make it painless. It's not like Python is making this decision in a vacuum either: Linux already has 32-bit wchar_t. 32-bit characters will eventually be common (even in Windows, which probably has the largest investment in 16-bit Unicode at the moment of any system). Like IPv6, we're trying to enable uncommon uses of Python without breaking things for the not-so-early adopters. Again, don't see PEP 261 as the ultimate answer to all your 32-bit Unicode questions. Just consider that realistically we have two choices: stick with 16-bit support only or make 32-bit support an option. Other approaches (more surrogate support, run-time choices, transparent variable-length encodings) simply aren't realistic -- no-one has the time to code them. It should be easy to write portable Python programs that work correctly with 16-bit Unicode characters on a "narrow" interpreter and also work correctly with 21-bit Unicode on a "wide" interpreter: just avoid using surrogates. If you *need* to work with surrogates, try to limit yourself to very simple operations like concatenations of valid strings, and splitting strings at known delimiters only. There's a lot you can do with this. --Guido van Rossum (home page: http://www.python.org/~guido/)
It so happened that the Unicode support was written to make it very easy to change the compile-time code unit size
What about extension modules that deal with Unicode strings? Will they have to be recompiled too? If so, is there anything to detect an attempt to import an extension module with an incompatible Unicode character width? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Greg Ewing wrote:
It so happened that the Unicode support was written to make it very easy to change the compile-time code unit size
What about extension modules that deal with Unicode strings? Will they have to be recompiled too? If so, is there anything to detect an attempt to import an extension module with an incompatible Unicode character width?
That's a good question ! The answer is: yes, extensions which use Unicode will have to be recompiled for narrow and wide builds of Python. The question is however, how to detect cases where the user imports an extension built for narrow Python into a wide build and vice versa. The standard way of looking at the API level won't help. We'd need some form of introspection API at the C level... hmm, perhaps looking at the sys module will do the trick for us ?! In any case, this is certainly going to cause trouble one of these days... -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
Greg Ewing wrote:
It so happened that the Unicode support was written to make it very easy to change the compile-time code unit size
What about extension modules that deal with Unicode strings? Will they have to be recompiled too? If so, is there anything to detect an attempt to import an extension module with an incompatible Unicode character width?
That's a good question !
The answer is: yes, extensions which use Unicode will have to be recompiled for narrow and wide builds of Python. The question is however, how to detect cases where the user imports an extension built for narrow Python into a wide build and vice versa.
The standard way of looking at the API level won't help. We'd need some form of introspection API at the C level... hmm, perhaps looking at the sys module will do the trick for us ?!
In any case, this is certainly going to cause trouble one of these days...
Here are some alternative ways to deal with this: (1) Use the preprocessor to rename all the Unicode APIs to get "Wide" appended to their name in wide mode. This makes any use of a Unicode API in an extension compiled for the wrong Py_UNICODE_SIZE fail with a link-time error. (Which should cause an ImportError for shared libraries.) (2) Ditto but only rename the PyModule_Init function. This is much less work but more coarse: a module that doesn't use any Unicode APIs (and I expect these will be a large majority) still would not be accepted. (3) Change the interpretation of PYTHON_API_VERSION so that a low bit of '1' means wide Unicode. Then you only get a warning (followed by a core dump when actually trying to use Unicode). I mentioned (1) and (3) in an earlier post. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
Greg Ewing wrote:
It so happened that the Unicode support was written to make it very easy to change the compile-time code unit size
What about extension modules that deal with Unicode strings? Will they have to be recompiled too? If so, is there anything to detect an attempt to import an extension module with an incompatible Unicode character width?
That's a good question !
The answer is: yes, extensions which use Unicode will have to be recompiled for narrow and wide builds of Python. The question is however, how to detect cases where the user imports an extension built for narrow Python into a wide build and vice versa.
The standard way of looking at the API level won't help. We'd need some form of introspection API at the C level... hmm, perhaps looking at the sys module will do the trick for us ?!
In any case, this is certainly going to cause trouble one of these days...
Here are some alternative ways to deal with this:
(1) Use the preprocessor to rename all the Unicode APIs to get "Wide" appended to their name in wide mode. This makes any use of a Unicode API in an extension compiled for the wrong Py_UNICODE_SIZE fail with a link-time error. (Which should cause an ImportError for shared libraries.)
(2) Ditto but only rename the PyModule_Init function. This is much less work but more coarse: a module that doesn't use any Unicode APIs (and I expect these will be a large majority) still would not be accepted.
(3) Change the interpretation of PYTHON_API_VERSION so that a low bit of '1' means wide Unicode. Then you only get a warning (followed by a core dump when actually trying to use Unicode).
I mentioned (1) and (3) in an earlier post.
(4) Add a feature flag to PyModule_Init() which then looks up the features in the sys module and uses this as basis for processing the import requrest. In this case, I think that (5) would be the best solution, since old code will notice the change in width too. -- Marc-Andre Lemburg ________________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
"M.-A. Lemburg" wrote:
...
(4) Add a feature flag to PyModule_Init() which then looks up the features in the sys module and uses this as basis for processing the import requrest.
Could an extension be carefully written so that a single binary could be compatible with both types of Python build? I'm thinking that it would pass data buffers with the "right width" based on checking a runtime flag... -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
Paul Prescod wrote:
Could an extension be carefully written so that a single binary could be compatible with both types of Python build? I'm thinking that it would pass data buffers with the "right width" based on checking a runtime flag...
But then it would also be compatible with a unicode object using different internal storage units per string, so I'm sure this is a dead end ;-) Just
Just van Rossum wrote:
Paul Prescod wrote:
Could an extension be carefully written so that a single binary could be compatible with both types of Python build? I'm thinking that it would pass data buffers with the "right width" based on checking a runtime flag...
But then it would also be compatible with a unicode object using different internal storage units per string, so I'm sure this is a dead end ;-)
Agreed :-) Extension writer will have to provide two versions of the binary. -- Marc-Andre Lemburg ________________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Just van Rossum <just@letterror.com>:
My difficulty with PEP 261 is that I'm afraid few people will actually enable 32-bit support (*what*?! all unicode strings become 32 bits wide? no way!), therefore making programs non-portable in very subtle ways.
I agree. This can only be a stopgap measure. Ultimately the Unicode type needs to be made smarter. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
greg wrote:
I agree. This can only be a stopgap measure. Ultimately the Unicode type needs to be made smarter.
PIL uses 8 bits per pixel to store bilevel images, and 32 bits per pixel to store 16- and 24-bit images. back in 1995, some people claimed that the image type had to be made smarter to be usable. these days, nobody ever notices... </F>
Fredrik Lundh <fredrik@pythonware.com>:
back in 1995, some people claimed that the image type had to be made smarter to be usable.
But at least you can use more than one depth of image in the same program... Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Guido van Rossum:
This wasn't usefully true in the past for DBCS strings and is not the right way to think of either narrow or wide strings now. The idea that strings are arrays of characters gets in the way of dealing with many encodings and is the primary difficulty in localising software for Japanese.
Can you explain the kind of problems encountered in some more detail?
Programmers used to working with character == indexable code unit will often split double wide characters when performing an action. For example searching for a particular double byte character "bc" may match "abcd" incorrectly where "ab" and "cd" are the characters. DBCS is not normally self synchronising although UTF-8 is. Another common problem is counting characters, for example when filling a line, hitting the line width and forcing half a character onto the next line.
I think it's a good idea to provide a set of higher-level tools as well. However nobody seems to know what these higher-level tools should do yet. PEP 261 is specifically focused on getting the lower-level foundations right (i.e. the objects that represent arrays of code units), so that the authors of higher level tools will have a solid base. If you want to help author a PEP for such higher-level tools, you're welcome!
Its more likely I'll publish some of the low level pieces of Scintilla/SinkWorld as a Python extension providing some of these facilities in an editable-text class. Then we can see if anyone else finds the code worthwhile. Neil
Neil Hodgson wrote:
Paul Prescod: <PEP: 261>
The problem I have with this PEP is that it is a compile time option which makes it hard to work with both 32 bit and 16 bit strings in one program. Can not the 32 bit string type be introduced as an additional type?
The two solutions are not mutually exclusive. If you (or someone) supplies a 32-bit type and Guido accepts it, then the compile option might fall into disuse. But this solution was chosen because it is much less work. Really though, I think that having 16-bit and 32-bit types is extra confusion for very little gain. I would much rather have a single space-efficient type that hid the details of its implementation. But nobody has volunteered to code it and Guido might not accept it even if someone did.
... This wasn't usefully true in the past for DBCS strings and is not the right way to think of either narrow or wide strings now. The idea that strings are arrays of characters gets in the way of dealing with many encodings and is the primary difficulty in localising software for Japanese.
The whole benfit of moving to 32-bit character strings is to allow people to think of strings as arrays of characters. Forcing them to consider variable-length encodings is precisely what we are trying to avoid.
Iteration through the code units in a string is a problem waiting to bite you and string APIs should encourage behaviour which is correct when faced with variable width characters, both DBCS and UTF style. Iteration over variable width characters should be performed in a way that preserves the integrity of the characters.
On wide Python builds there is no such thing as variable width Unicode characters. It doesn't make sense to combine two 32-bit characters to get a 64-bit one. On narrow Python builds you might want to treat a surrogate pair as a single character but I would strongly advise against it. If you want wide characters, move to a wide build. Even if a narrow build is more space efficient, you'll lose a ton of performance emulating wide characters in Python code.
... M.-A. Lemburg's proposed set of iterators could be extended to indicate encoding "for c in s.asCharacters('utf-8')" and to provide for the various intended string uses such as "for c in s.inVisualOrder()" reversing the receipt of right-to-left substrings.
A floor wax and a desert topping. <0.5 wink> I don't think that the average Python programmer would want s.asCharacters('utf-8') when they already have s.decode('utf-8'). We decided a long time ago that the model for standard users would be fixed-length (1!), abstract characters. That's the way Python's Unicode subsystem has always worked. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
Paul Prescod wrote:
On wide Python builds there is no such thing as variable width Unicode characters. It doesn't make sense to combine two 32-bit characters to get a 64-bit one. On narrow Python builds you might want to treat a surrogate pair as a single character but I would strongly advise against it. If you want wide characters, move to a wide build. Even if a narrow build is more space efficient, you'll lose a ton of performance emulating wide characters in Python code.
This needn't go into the PEP, I think, but I'd like you to say something about what you expect the end result of this PEP to look like under Windows, where "rebuild" isn't really a valid option for most Python users. Are we simply committing to make two builds available? If so, what happens the next time we run into a situation like this? -- --- Aahz (@pobox.com) Hugs and backrubs -- I break Rule 6 <*> http://www.rahul.net/aahz/ Androgynous poly kinky vanilla queer het Pythonista I don't really mind a person having the last whine, but I do mind someone else having the last self-righteous whine.
This needn't go into the PEP, I think, but I'd like you to say something about what you expect the end result of this PEP to look like under Windows, where "rebuild" isn't really a valid option for most Python users. Are we simply committing to make two builds available? If so, what happens the next time we run into a situation like this?
I imagine that we will pick a choice (I expect it'll be UCS2) and make only that build available, until there are loud enough cries from folks who have a reasonable excuse not to have a copy of VCC around. Given that the rest of Windows uses 16-bit Unicode, I think we'll be able to get away with this for quite a while. --Guido van Rossum (home page: http://www.python.org/~guido/)
Aahz Maruch wrote:
...
This needn't go into the PEP, I think, but I'd like you to say something about what you expect the end result of this PEP to look like under Windows, where "rebuild" isn't really a valid option for most Python users. Are we simply committing to make two builds available? If so, what happens the next time we run into a situation like this?
Windows itself is strongly biased towards 16-bit characters. Therefore I expect that to be the default for a while. Then I expect Guido to announce that 32-bit characters are the new default with version 3000 (perhaps right after Windows 3000 ships) and we'll all change. So most Windows users will not be able to work with 32-bit characters for a while. But since Windows itself doesn't like those characters, they probably won't run into them much. I strongly doubt that we'll ever make two builds available because it would cause a mess of extension module incompatibilities. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
Neil Hodgson wrote:
u[i] is a character. If u is Unicode, then u[i] is a Python Unicode character.
This wasn't usefully true in the past for DBCS strings and is not the right way to think of either narrow or wide strings now. The idea that strings are arrays of characters gets in the way
if you stop confusing binary buffers with text strings, all such problems will go away. </F>
Paul Prescod wrote:
"M.-A. Lemburg" wrote:
...
The term "character" in Python should really only be used for the 8-bit strings.
Are we going to change chr() and unichr() to one_element_string() and unicode_one_element_string()
No. I am just suggesting to make use of the crispy clear definitions which the Unicode Consortium has developed for us.
u[i] is a character. If u is Unicode, then u[i] is a Python Unicode character. No Python user will find that confusing no matter how Unicode knuckle-dragging, mouth-breathing, wife-by-hair-dragging they are.
Except that u[i] maps to a code unit which may or may not be a code point. Whether a code point matches a grapheme (this is what users tend to regard as character) is yet another story due to combining code points.
In Unicode a "character" can mean any of:
Mark Davis said that "people" can use the word to mean any of those things. He did not say that it was imprecisely defined in Unicode. Nevertheless I'm not using the Unicode definition anymore than our standard library uses an ancient Greek definition of integer. Python has a concept of integer and a concept of character.
Ok, I'll stop whining. Just as final remark, let me say that our little discussion is a perfect example of how people can misunderstand each other by using the terms in different ways (Kant tried to solve this for Philosophy and did not succeed; so I guess the Unicode Consortium doesn't stand a chance either ;-)
It has been proposed that there should be a module for working with UTF-16 strings in narrow Python builds through some sort of abstraction that handles surrogates for you. If someone wants to implement that, it will be another PEP.
Uhm, narrow builds don't support UTF-16... it's UCS-2 which is supported (basically: store everything in range(0x10000)); the codecs can map code points to surrogates, but it is solely their responsibility and the responsibility of the application using them to take care of dealing with surrogates.
The user can view the data as UCS-2, UTF-16, Base64, ROT-13, XML, .... Just as we have a base64 module, we could have a UTF-16 module that interprets the data in the string as UTF-16 and does surrogate manipulation for you.
Anyhow, if any of those is the "real" encoding of the data, it is UTF-16. After all, if the codec reads in four non-BMP characters in, let's say, UTF-8, we represent them as 8 narrow-build Python characters. That's the definition of UTF-16! But it's easy enough for me to take that word out so I will.
u[i] gives you a code unit and whether this maps to a code point or not is dependent on the implementation which in turn depends on the narrow/wide choice. In UCS-2, I believe, surrogates are regarded as two code points; in UTF-16 they always have to come in pairs. There's a semantic difference here which is for the codecs and these additional tools to be aware of -- not the Unicode type implementation.
... Also, the module will be useful for both narrow and wide builds, since the notion of an encoded character can involve multiple code points. In that sense Unicode is always a variable length encoding for characters and that's the application field of this module.
I wouldn't advise that you do all different types of normalization in a single module but I'll wait for your PEP.
I'll see if I find some time at the Bordeaux Python Meeting next week.
Here's the adjusted text:
It has been proposed that there should be a module for working with Unicode objects using character-, word- and line- based indexing. The details of the implementation is left to another PEP.
It has been proposed that there should be a module that handles surrogates in narrow Python builds for programmers. If someone wants to implement that, it will be another PEP. It might also be combined with features that allow other kinds of character-, word- and line- based indexing.
Hmm, I liked my version better, but what the heck ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
participants (10)
-
aahz@rahul.net
-
Fredrik Lundh
-
Greg Ewing
-
Guido van Rossum
-
Just van Rossum
-
M.-A. Lemburg
-
M.-A. Lemburg
-
Neil Hodgson
-
Paul Prescod
-
Tim Peters