Re: [Python-Dev] Support for "wide" Unicode characters

July 1, 2001


      "M.-A. Lemburg" wrote:
...
...
The term "character" in Python should really only be used for
the 8-bit strings.
Are we going to change chr() and unichr() to one_element_string() and
unicode_one_element_string()

u[i] is a character. If u is Unicode, then u[i] is a Python Unicode
character. No Python user will find that confusing no matter how Unicode
knuckle-dragging, mouth-breathing, wife-by-hair-dragging they are.
...
In Unicode a "character" can mean any of:
Mark Davis said that "people" can use the word to mean any of those
things. He did not say that it was imprecisely defined in Unicode.
Nevertheless I'm not using the Unicode definition anymore than our
standard library uses an ancient Greek definition of integer. Python has
a concept of integer and a concept of character.
...
...
It has been proposed that there should be a module for working
    with UTF-16 strings in narrow Python builds through some sort of
    abstraction that handles surrogates for you. If someone wants
    to implement that, it will be another PEP.
Uhm, narrow builds don't support UTF-16... it's UCS-2 which
is supported (basically: store everything in range(0x10000));
the codecs can map code points to surrogates, but it is solely
their responsibility and the responsibility of the application
using them to take care of dealing with surrogates.
The user can view the data as UCS-2, UTF-16, Base64, ROT-13, XML, ....
Just as we have a base64 module, we could have a UTF-16 module that
interprets the data in the string as UTF-16 and does surrogate
manipulation for you.

Anyhow, if any of those is the "real" encoding of the data, it is
UTF-16. After all, if the codec reads in four non-BMP characters in,
let's say, UTF-8, we represent them as 8 narrow-build Python characters.
That's the definition of UTF-16! But it's easy enough for me to take
that word out so I will.
...
...
Also, the module will be useful for both narrow and wide builds,
since the notion of an encoded character can involve multiple code
points. In that sense Unicode is always a variable length
encoding for characters and that's the application field of
this module.
I wouldn't advise that you do all different types of normalization in a
single module but I'll wait for your PEP.
...
Here's the adjusted text:
It has been proposed that there should be a module for working
     with Unicode objects using character-, word- and line- based
     indexing. The details of the implementation is left to
     another PEP.
It has been proposed that there should be a module that handles
     surrogates in narrow Python builds for programmers. If someone 
     wants to implement that, it will be another PEP. It might also be 
     combined with features that allow other kinds of character-, 
     word- and line- based indexing.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook

Re: [Python-Dev] Support for "wide" Unicode characters

Paul Prescod