[Python-Dev] Support for "wide" Unicode characters

Paul Prescod paulp@ActiveState.com
Sat, 30 Jun 2001 21:04:49 -0700


"M.-A. Lemburg" wrote:
> 
>...
> 
> The term "character" in Python should really only be used for
> the 8-bit strings. 

Are we going to change chr() and unichr() to one_element_string() and
unicode_one_element_string()

u[i] is a character. If u is Unicode, then u[i] is a Python Unicode
character. No Python user will find that confusing no matter how Unicode
knuckle-dragging, mouth-breathing, wife-by-hair-dragging they are.

> In Unicode a "character" can mean any of:

Mark Davis said that "people" can use the word to mean any of those
things. He did not say that it was imprecisely defined in Unicode.
Nevertheless I'm not using the Unicode definition anymore than our
standard library uses an ancient Greek definition of integer. Python has
a concept of integer and a concept of character.

> >     It has been proposed that there should be a module for working
> >     with UTF-16 strings in narrow Python builds through some sort of
> >     abstraction that handles surrogates for you. If someone wants
> >     to implement that, it will be another PEP.
> 
> Uhm, narrow builds don't support UTF-16... it's UCS-2 which
> is supported (basically: store everything in range(0x10000));
> the codecs can map code points to surrogates, but it is solely
> their responsibility and the responsibility of the application
> using them to take care of dealing with surrogates.

The user can view the data as UCS-2, UTF-16, Base64, ROT-13, XML, ....
Just as we have a base64 module, we could have a UTF-16 module that
interprets the data in the string as UTF-16 and does surrogate
manipulation for you.

Anyhow, if any of those is the "real" encoding of the data, it is
UTF-16. After all, if the codec reads in four non-BMP characters in,
let's say, UTF-8, we represent them as 8 narrow-build Python characters.
That's the definition of UTF-16! But it's easy enough for me to take
that word out so I will.

>...
> Also, the module will be useful for both narrow and wide builds,
> since the notion of an encoded character can involve multiple code
> points. In that sense Unicode is always a variable length
> encoding for characters and that's the application field of
> this module.

I wouldn't advise that you do all different types of normalization in a
single module but I'll wait for your PEP.

> Here's the adjusted text:
> 
>      It has been proposed that there should be a module for working
>      with Unicode objects using character-, word- and line- based
>      indexing. The details of the implementation is left to
>      another PEP.
 
     It has been proposed that there should be a module that handles
     surrogates in narrow Python builds for programmers. If someone 
     wants to implement that, it will be another PEP. It might also be 
     combined with features that allow other kinds of character-, 
     word- and line- based indexing.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook