Re: [Python-Dev] Support for "wide" Unicode characters

2 Jul 2001


      Paul Prescod wrote:
...
"M.-A. Lemburg" wrote:
...
...
The term "character" in Python should really only be used for
the 8-bit strings.
Are we going to change chr() and unichr() to one_element_string() and
unicode_one_element_string()
No. I am just suggesting to make use of the crispy clear
definitions which the Unicode Consortium has developed for us.
...
u[i] is a character. If u is Unicode, then u[i] is a Python Unicode
character. No Python user will find that confusing no matter how Unicode
knuckle-dragging, mouth-breathing, wife-by-hair-dragging they are.
Except that u[i] maps to a code unit which may or may not be
a code point. Whether a code point matches a grapheme (this
is what users tend to regard as character) is yet another
story due to combining code points.
...
...
In Unicode a "character" can mean any of:
Mark Davis said that "people" can use the word to mean any of those
things. He did not say that it was imprecisely defined in Unicode.
Nevertheless I'm not using the Unicode definition anymore than our
standard library uses an ancient Greek definition of integer. Python has
a concept of integer and a concept of character.
Ok, I'll stop whining. Just as final remark, let me say that
our little discussion is a perfect example of how people can
misunderstand each other by using the terms in different ways
(Kant tried to solve this for Philosophy and did not succeed;
so I guess the Unicode Consortium doesn't stand a chance 
either ;-)
...
...
...
It has been proposed that there should be a module for working
    with UTF-16 strings in narrow Python builds through some sort of
    abstraction that handles surrogates for you. If someone wants
    to implement that, it will be another PEP.
Uhm, narrow builds don't support UTF-16... it's UCS-2 which
is supported (basically: store everything in range(0x10000));
the codecs can map code points to surrogates, but it is solely
their responsibility and the responsibility of the application
using them to take care of dealing with surrogates.
The user can view the data as UCS-2, UTF-16, Base64, ROT-13, XML, ....
Just as we have a base64 module, we could have a UTF-16 module that
interprets the data in the string as UTF-16 and does surrogate
manipulation for you.
Anyhow, if any of those is the "real" encoding of the data, it is
UTF-16. After all, if the codec reads in four non-BMP characters in,
let's say, UTF-8, we represent them as 8 narrow-build Python characters.
That's the definition of UTF-16! But it's easy enough for me to take
that word out so I will.
u[i] gives you a code unit and whether this maps to a code point
or not is dependent on the implementation which in turn depends
on the narrow/wide choice.

In UCS-2, I believe, surrogates are regarded as two code points;
in UTF-16 they always have to come in pairs. There's a semantic
difference here which is for the codecs and these additional
tools to be aware of -- not the Unicode type implementation.
...
...
...
Also, the module will be useful for both narrow and wide builds,
since the notion of an encoded character can involve multiple code
points. In that sense Unicode is always a variable length
encoding for characters and that's the application field of
this module.
I wouldn't advise that you do all different types of normalization in a
single module but I'll wait for your PEP.
I'll see if I find some time at the Bordeaux Python Meeting
next week.
...
...
Here's the adjusted text:
It has been proposed that there should be a module for working
     with Unicode objects using character-, word- and line- based
     indexing. The details of the implementation is left to
     another PEP.
It has been proposed that there should be a module that handles
     surrogates in narrow Python builds for programmers. If someone
     wants to implement that, it will be another PEP. It might also be
     combined with features that allow other kinds of character-,
     word- and line- based indexing.
Hmm, I liked my version better, but what the heck ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/