[Python-Dev] Support for "wide" Unicode characters

Paul Prescod paulp@ActiveState.com
Fri, 29 Jun 2001 18:16:25 -0700


"M.-A. Lemburg" wrote:
> 
>...
> 
> I'd suggest not to use the term character in this PEP at all;
> this is also what Mark Davis recommends in his paper on Unicode.

That's fine, but Python does have a concept of character and I'm going
to use the term character for discussing these.

> Also, a link to the Unicode glossary would be a good thing.

Funny how these little PEPs grow...

>...
> Why not make the codec used by Python to convert Unicode
> literals to Unicode strings an option just like the default
> encoding ?
>
> That way we could have a version of the unicode-escape codec
> which supports surrogates and one which doesn't.

Adding more and more knobs to tweak just adds up to Python code being
non-portable from one machine to another.

> >          ISSUE: Should Python allow the construction of characters
> >                that do not correspond to Unicode characters?
> >                Unassigned Unicode characters should obviously be legal
> >                (because they could be assigned at any time). But
> >                code points above TOPCHAR are guaranteed never to
> >                be used by Unicode. Should we allow access to them
> >                anyhow?
> 
> I wouldn't count on that last point ;-)
>  
> Please note that you are mixing terms: you don't construct
> characters, you construct code points. Whether the concatenation
> of these code points makes a valid Unicode character string
> is an issue which applications and codecs have to decide.

unichr() does not construct code points. It constructs 1-char Python
Unicode strings...also known as Python Unicode characters.

> ... Whether the concatenation
> of these code points makes a valid Unicode character string
> is an issue which applications and codecs have to decide.

The concatenation of true code points would *always* make a valid
Unicode string, right? It's code units that cannot be blindly
concatenated.

>...
> We should provide a new module which provides a few handy
> utilities though: functions which provide code point-,
> character-, word- and line- based indexing into Unicode
> strings.

Okay, I'll add:

    It has been proposed that there should be a module for working
    with UTF-16 strings in narrow Python builds through some sort of
    abstraction that handles surrogates for you. If someone wants
    to implement that, it will be another PEP.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook