Paul Prescod wrote:
"M.-A. Lemburg" wrote:
...
The term "character" in Python should really only be used for the 8-bit strings.
Are we going to change chr() and unichr() to one_element_string() and unicode_one_element_string()
No. I am just suggesting to make use of the crispy clear definitions which the Unicode Consortium has developed for us.
u[i] is a character. If u is Unicode, then u[i] is a Python Unicode character. No Python user will find that confusing no matter how Unicode knuckle-dragging, mouth-breathing, wife-by-hair-dragging they are.
Except that u[i] maps to a code unit which may or may not be a code point. Whether a code point matches a grapheme (this is what users tend to regard as character) is yet another story due to combining code points.
In Unicode a "character" can mean any of:
Mark Davis said that "people" can use the word to mean any of those things. He did not say that it was imprecisely defined in Unicode. Nevertheless I'm not using the Unicode definition anymore than our standard library uses an ancient Greek definition of integer. Python has a concept of integer and a concept of character.
Ok, I'll stop whining. Just as final remark, let me say that our little discussion is a perfect example of how people can misunderstand each other by using the terms in different ways (Kant tried to solve this for Philosophy and did not succeed; so I guess the Unicode Consortium doesn't stand a chance either ;-)
It has been proposed that there should be a module for working with UTF-16 strings in narrow Python builds through some sort of abstraction that handles surrogates for you. If someone wants to implement that, it will be another PEP.
Uhm, narrow builds don't support UTF-16... it's UCS-2 which is supported (basically: store everything in range(0x10000)); the codecs can map code points to surrogates, but it is solely their responsibility and the responsibility of the application using them to take care of dealing with surrogates.
The user can view the data as UCS-2, UTF-16, Base64, ROT-13, XML, .... Just as we have a base64 module, we could have a UTF-16 module that interprets the data in the string as UTF-16 and does surrogate manipulation for you.
Anyhow, if any of those is the "real" encoding of the data, it is UTF-16. After all, if the codec reads in four non-BMP characters in, let's say, UTF-8, we represent them as 8 narrow-build Python characters. That's the definition of UTF-16! But it's easy enough for me to take that word out so I will.
u[i] gives you a code unit and whether this maps to a code point or not is dependent on the implementation which in turn depends on the narrow/wide choice. In UCS-2, I believe, surrogates are regarded as two code points; in UTF-16 they always have to come in pairs. There's a semantic difference here which is for the codecs and these additional tools to be aware of -- not the Unicode type implementation.
... Also, the module will be useful for both narrow and wide builds, since the notion of an encoded character can involve multiple code points. In that sense Unicode is always a variable length encoding for characters and that's the application field of this module.
I wouldn't advise that you do all different types of normalization in a single module but I'll wait for your PEP.
I'll see if I find some time at the Bordeaux Python Meeting next week.
Here's the adjusted text:
It has been proposed that there should be a module for working with Unicode objects using character-, word- and line- based indexing. The details of the implementation is left to another PEP.
It has been proposed that there should be a module that handles surrogates in narrow Python builds for programmers. If someone wants to implement that, it will be another PEP. It might also be combined with features that allow other kinds of character-, word- and line- based indexing.
Hmm, I liked my version better, but what the heck ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/