[Python-Dev] 2.2 Unicode questions
Mon, 23 Jul 2001 10:52:18 +0200
Fredrik Lundh wrote:
> mal wrote:
> > Same here: UTF-16 -> UCS-2. Note that I very much favour
> > removing the surrogate generation in unichr() for UCS2-builds.
> > If I don't here strong opposition, I'll disable this feature
> > which was added as part of the UCS-4 patches. unichr()
> > will then raise an exception as it did in version 2.1.
> the rationale behind this change was that unichr() should
> behave like the \U escape.
Please note that unichr() is a low-level API which is part
of the Unicode implementation. The implementation itself
does not handle surrogates in any special way, only the codecs
do (and after my last checkin unicode-escape and UTF-16 do
handle surrogates correctly).
To simplify the picture: the implementation itself only sees
UCS-2 or UCS-4 depending on the compile time option and these
do not treat surrogates in any special way except reserve
code points for their usage. Accordingly, unichr() should not
create UTF-16 but UCS-2 for narrow builds and UCS-4 on wide
builds (unichr() is a contructor for code units, not code
If an application needs an UTF-16 generating API, then it can
easily implement one using the UCS-2 generating
unichr() API to create Unicode code units representing
> (they both take a 32-bit character code, and turn it into
> a unicode string; see GvR's mails in the ucs4 thread for more
> on this topic).
> don't change one of them without considering if the other
> one really does the right thing.
For those of you who are not too much into all these
code unit vs. code point vs. character discussions, a look at
the slides of the talk I gave at the European Python Meeting
in Bordeaux may provide some insights:
CEO eGenix.com Software GmbH
Consulting & Company: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/