[Python-Dev] 2.2 Unicode questions

M.-A. Lemburg mal@lemburg.com
Mon, 23 Jul 2001 10:52:18 +0200

Fredrik Lundh wrote:
> mal wrote:
> > Same here: UTF-16 -> UCS-2. Note that I very much favour
> > removing the surrogate generation in unichr() for UCS2-builds.
> >
> > If I don't here strong opposition, I'll disable this feature
> > which was added as part of the UCS-4 patches. unichr()
> > will then raise an exception as it did in version 2.1.
> the rationale behind this change was that unichr() should
> behave like the \U escape.

Please note that unichr() is a low-level API which is part
of the Unicode implementation. The implementation itself
does not handle surrogates in any special way, only the codecs
do (and after my last checkin unicode-escape and UTF-16 do
handle surrogates correctly).

To simplify the picture: the implementation itself only sees
UCS-2 or UCS-4 depending on the compile time option and these
do not treat surrogates in any special way except reserve
code points for their usage. Accordingly, unichr() should not
create UTF-16 but UCS-2 for narrow builds and UCS-4 on wide
builds (unichr() is a contructor for code units, not code 

If an application needs an UTF-16 generating API, then it can 
easily implement one using the UCS-2 generating
unichr() API to create Unicode code units representing 
isolated surrogates.

> (they both take a 32-bit character code, and turn it into
> a unicode string; see GvR's mails in the ucs4 thread for more
> on this topic).
> don't change one of them without considering if the other
> one really does the right thing.


For those of you who are not too much into all these
code unit vs. code point vs. character discussions, a look at
the slides of the talk I gave at the European Python Meeting 
in Bordeaux may provide some insights:



Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/