[Python-Dev] 2.2 Unicode questions

Mon, 23 Jul 2001 12:36:38 +0200

Fredrik Lundh wrote:
> 
> MAL wrote:
> > To simplify the picture: the implementation itself only sees
> > UCS-2 or UCS-4 depending on the compile time option and these
> > do not treat surrogates in any special way except reserve
> > code points for their usage. Accordingly, unichr() should not
> > create UTF-16 but UCS-2 for narrow builds and UCS-4 on wide
> > builds
> 
> you didn't answer my question: is there any reason why
> unichr(0xXXXXXXXX) shouldn't return exactly the same
> thing as "\UXXXXXXXX" ?
> 
> in 2.0 and 2.1, it doesn't.  in 2.2, it does.
>
> > (unichr() is a contructor for code units, not code points).

Doesn't this answer your question ? The point I wanted to
make was that unichr() is a constructor for a single code unit
just like chr() is a constructor for a single code unit -- in
that sense the storage format used by the implementation defines
the outcome: for UCS-2 builds, it can only create UCS-2 values,
for UCS-4 builds, UCS-4 values are possible as well.

The question of u"\UXXXXXXXX" creating surrogates on UCS-2
builds is different: \UXXXXXXXX is an encoding of a Unicode
code point, so the codec has to decide whether or not to
map this to two code units or an exception on UCS-2 builds.

> really?  according to the documentation, it creates unicode
> *characters*.  so does \U, according to the documentation.
> 
> imo, it makes more sense to let "characters" mean code points
> than code units, but that's me. 

The term "character" is vastly overloaded. There are three
different forms of interpretation: graphemes (this is what
a user usually sees as character on her display), codec points
(this is what Unicode encodes) and code units (this is what
the implementation uses a atom for storing code points).

Since Python exposes code units (u[0] gives you direct access
to the implementation defined storage area) and makes no
assumption about surrogates, it would not be a good idea to
suddenly introduce a break in the meaning of the outcome of
indexing into a Unicode string (u[0]) and len(unichr()).

I know that the name unichr() does not help in this situation,
the correct name would be unicodeunit().

> the important thing here is to
> figure out if \U and unichr are the same thing, and fix the code
> and the documentation to do/say what we mean.

Right.

Note that apart from agreeing on a common meaning, we should
also think about the consequences of breaking len(unichr())==1,
e.g. when creating a Unicode string using unichr() you'd expect
to find the generated code unit at the position you appended
it to the Unicode object.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/