[Python-Dev] Support for "wide" Unicode characters

M.-A. Lemburg mal@egenix.com
Sat, 30 Jun 2001 13:52:38 +0200

Paul Prescod wrote:
> "M.-A. Lemburg" wrote:
> >
> >...
> >
> > I'd suggest not to use the term character in this PEP at all;
> > this is also what Mark Davis recommends in his paper on Unicode.
> That's fine, but Python does have a concept of character and I'm going
> to use the term character for discussing these.

The term "character" in Python should really only be used for
the 8-bit strings. In Unicode a "character" can mean any of:

  Unfortunately the term character is vastly
  overloaded. At various times people can use it to mean any of these things:

  -    An image on paper (glyph) 
  -    What an end-user thinks of as a character (grapheme) 
  -    What a character encoding standard encodes (code point) 
  -    A memory storage unit in a character encoding (code unit) 

  Because of this, ironically, it is best to avoid the use of the term character 
  entirely when discussing character encodings, and stick
  to the term code point.
Taken from Mark Davis' paper:

> > Also, a link to the Unicode glossary would be a good thing.
> Funny how these little PEPs grow...

Is that a problem ?

The Unicode glossary is very useful in providing a common
base for understanding the different terms and tries very
hard to avoid ambiguity in meaning.

This discussion is partly caused by exactly these different
understanding of the terms used in the PEP.

I will update the Unicode PEP to the Unicode terminology too.

> >...
> > Why not make the codec used by Python to convert Unicode
> > literals to Unicode strings an option just like the default
> > encoding ?
> >
> > That way we could have a version of the unicode-escape codec
> > which supports surrogates and one which doesn't.
> Adding more and more knobs to tweak just adds up to Python code being
> non-portable from one machine to another.

Not necessarily so; I'll write a more precise spec next
week. The idea is to put the codec information into the
Python source code, so that it is bound to the literals
that way with the result of the Python source code being portable
across platforms.

Currently this is just an idea and still have to check how far
this can go...
> > >          ISSUE: Should Python allow the construction of characters
> > >                that do not correspond to Unicode characters?
> > >                Unassigned Unicode characters should obviously be legal
> > >                (because they could be assigned at any time). But
> > >                code points above TOPCHAR are guaranteed never to
> > >                be used by Unicode. Should we allow access to them
> > >                anyhow?
> >
> > I wouldn't count on that last point ;-)
> >
> > Please note that you are mixing terms: you don't construct
> > characters, you construct code points. Whether the concatenation
> > of these code points makes a valid Unicode character string
> > is an issue which applications and codecs have to decide.
> unichr() does not construct code points. It constructs 1-char Python
> Unicode strings...also known as Python Unicode characters.
> > ... Whether the concatenation
> > of these code points makes a valid Unicode character string
> > is an issue which applications and codecs have to decide.
> The concatenation of true code points would *always* make a valid
> Unicode string, right? It's code units that cannot be blindly
> concatenated.

Both wrong :-)

U+D800 is a valid Unicode code point and can occur as
code unit in both narrow and wide builds. Concatenating
this with e.g. U+0020 will still make it a valid Unicode
code point sequence (aka Unicode object), but not a valid 
Unicode character string (since the U+D800 is not a character).

The same is true for e.g. U+FFFF.

Note that the Unicode type should happily store these values,
while the codecs complain. As a result and like I said above,
dealing with these problems is left to the applications which
use these Unicode objects.
> >...
> > We should provide a new module which provides a few handy
> > utilities though: functions which provide code point-,
> > character-, word- and line- based indexing into Unicode
> > strings.
> Okay, I'll add:
>     It has been proposed that there should be a module for working
>     with UTF-16 strings in narrow Python builds through some sort of
>     abstraction that handles surrogates for you. If someone wants
>     to implement that, it will be another PEP.

Uhm, narrow builds don't support UTF-16... it's UCS-2 which
is supported (basically: store everything in range(0x10000));
the codecs can map code points to surrogates, but it is solely
their responsibility and the responsibility of the application
using them to take care of dealing with surrogates.

Also, the module will be useful for both narrow and wide builds,
since the notion of an encoded character can involve multiple code
points. In that sense Unicode is always a variable length
encoding for characters and that's the application field of
this module.

Here's the adjusted text:

     It has been proposed that there should be a module for working
     with Unicode objects using character-, word- and line- based 
     indexing. The details of the implementation is left to 
     another PEP.

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/