
Paul Prescod wrote:
"M.-A. Lemburg" wrote:
...
I'd suggest not to use the term character in this PEP at all; this is also what Mark Davis recommends in his paper on Unicode.
That's fine, but Python does have a concept of character and I'm going to use the term character for discussing these.
The term "character" in Python should really only be used for the 8-bit strings. In Unicode a "character" can mean any of: """ Unfortunately the term character is vastly overloaded. At various times people can use it to mean any of these things: - An image on paper (glyph) - What an end-user thinks of as a character (grapheme) - What a character encoding standard encodes (code point) - A memory storage unit in a character encoding (code unit) Because of this, ironically, it is best to avoid the use of the term character entirely when discussing character encodings, and stick to the term code point. """ Taken from Mark Davis' paper: http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
Also, a link to the Unicode glossary would be a good thing.
Funny how these little PEPs grow...
Is that a problem ? The Unicode glossary is very useful in providing a common base for understanding the different terms and tries very hard to avoid ambiguity in meaning. This discussion is partly caused by exactly these different understanding of the terms used in the PEP. I will update the Unicode PEP to the Unicode terminology too.
... Why not make the codec used by Python to convert Unicode literals to Unicode strings an option just like the default encoding ?
That way we could have a version of the unicode-escape codec which supports surrogates and one which doesn't.
Adding more and more knobs to tweak just adds up to Python code being non-portable from one machine to another.
Not necessarily so; I'll write a more precise spec next week. The idea is to put the codec information into the Python source code, so that it is bound to the literals that way with the result of the Python source code being portable across platforms. Currently this is just an idea and still have to check how far this can go...
ISSUE: Should Python allow the construction of characters that do not correspond to Unicode characters? Unassigned Unicode characters should obviously be legal (because they could be assigned at any time). But code points above TOPCHAR are guaranteed never to be used by Unicode. Should we allow access to them anyhow?
I wouldn't count on that last point ;-)
Please note that you are mixing terms: you don't construct characters, you construct code points. Whether the concatenation of these code points makes a valid Unicode character string is an issue which applications and codecs have to decide.
unichr() does not construct code points. It constructs 1-char Python Unicode strings...also known as Python Unicode characters.
... Whether the concatenation of these code points makes a valid Unicode character string is an issue which applications and codecs have to decide.
The concatenation of true code points would *always* make a valid Unicode string, right? It's code units that cannot be blindly concatenated.
Both wrong :-) U+D800 is a valid Unicode code point and can occur as code unit in both narrow and wide builds. Concatenating this with e.g. U+0020 will still make it a valid Unicode code point sequence (aka Unicode object), but not a valid Unicode character string (since the U+D800 is not a character). The same is true for e.g. U+FFFF. Note that the Unicode type should happily store these values, while the codecs complain. As a result and like I said above, dealing with these problems is left to the applications which use these Unicode objects.
... We should provide a new module which provides a few handy utilities though: functions which provide code point-, character-, word- and line- based indexing into Unicode strings.
Okay, I'll add:
It has been proposed that there should be a module for working with UTF-16 strings in narrow Python builds through some sort of abstraction that handles surrogates for you. If someone wants to implement that, it will be another PEP.
Uhm, narrow builds don't support UTF-16... it's UCS-2 which is supported (basically: store everything in range(0x10000)); the codecs can map code points to surrogates, but it is solely their responsibility and the responsibility of the application using them to take care of dealing with surrogates. Also, the module will be useful for both narrow and wide builds, since the notion of an encoded character can involve multiple code points. In that sense Unicode is always a variable length encoding for characters and that's the application field of this module. Here's the adjusted text: It has been proposed that there should be a module for working with Unicode objects using character-, word- and line- based indexing. The details of the implementation is left to another PEP. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/