
"M.-A. Lemburg" wrote:
...
The term "character" in Python should really only be used for the 8-bit strings.
Are we going to change chr() and unichr() to one_element_string() and unicode_one_element_string() u[i] is a character. If u is Unicode, then u[i] is a Python Unicode character. No Python user will find that confusing no matter how Unicode knuckle-dragging, mouth-breathing, wife-by-hair-dragging they are.
In Unicode a "character" can mean any of:
Mark Davis said that "people" can use the word to mean any of those things. He did not say that it was imprecisely defined in Unicode. Nevertheless I'm not using the Unicode definition anymore than our standard library uses an ancient Greek definition of integer. Python has a concept of integer and a concept of character.
It has been proposed that there should be a module for working with UTF-16 strings in narrow Python builds through some sort of abstraction that handles surrogates for you. If someone wants to implement that, it will be another PEP.
Uhm, narrow builds don't support UTF-16... it's UCS-2 which is supported (basically: store everything in range(0x10000)); the codecs can map code points to surrogates, but it is solely their responsibility and the responsibility of the application using them to take care of dealing with surrogates.
The user can view the data as UCS-2, UTF-16, Base64, ROT-13, XML, .... Just as we have a base64 module, we could have a UTF-16 module that interprets the data in the string as UTF-16 and does surrogate manipulation for you. Anyhow, if any of those is the "real" encoding of the data, it is UTF-16. After all, if the codec reads in four non-BMP characters in, let's say, UTF-8, we represent them as 8 narrow-build Python characters. That's the definition of UTF-16! But it's easy enough for me to take that word out so I will.
... Also, the module will be useful for both narrow and wide builds, since the notion of an encoded character can involve multiple code points. In that sense Unicode is always a variable length encoding for characters and that's the application field of this module.
I wouldn't advise that you do all different types of normalization in a single module but I'll wait for your PEP.
Here's the adjusted text:
It has been proposed that there should be a module for working with Unicode objects using character-, word- and line- based indexing. The details of the implementation is left to another PEP.
It has been proposed that there should be a module that handles surrogates in narrow Python builds for programmers. If someone wants to implement that, it will be another PEP. It might also be combined with features that allow other kinds of character-, word- and line- based indexing. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook