[Python-Dev] UCS2/UCS4 default

Thu Jul 3 15:57:41 CEST 2008

On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:
> -On [20080703 15:00], M.-A. Lemburg (mal at egenix.com) wrote:
>> Unicode if full of combining code points - if you break such a sequence,
>> the output will be just as wrong; regardless of UCS2 vs. UCS4.
> 
> In my opinion you are confusing two related, but very separated things here.
> Combining characters have nothing to do with breaking up the encoding of a
> single codepoint. Sure enough, if you arbitrary slice up codepoints that
> consist of combining characters then your result is indeed odd looking.
> 
> I never said that nor is that the point I am making.

Please remember that lone surrogate pair code points are perfectly
valid Unicode code points, nevertheless. Just as a lone combining
code point is valid on its own.

> Guido points out that Python supports surrogate pairs and says that if
> Python is dealing wrongly with this in the core than it needs to be fixed.
> I am pointing out that given the fact we allow surrogate pairs we deal
> rather simplistic with it in the core. In fact, we do not consider them at
> all. In essence: though we may accept full 21-bit codepoints in the form of
> \U00000000 escape sequences and store them internally as UTF-16 (which I
> still need to verify) we subsequently deal with them programmatically as
> UCS-2, which is plain silly.

Python applies conversion from non-BMP code points to surroagtes
for UCS builds in a few places and I agree that we should probably
do that at a few more places.

However, these are mainly conversion issues of encoded Unicode
representations vs. the internal Unicode storage where you want
to avoid exceptions in favor of finding a solution that preserves
data.

To make it clear: UCS2 builds of Python do not support non-BMP
code points out of the box.

A programmer will always have to use a codec to map the internal storage
on these builds to the full Unicode code point range. The following
codecs support surrogates on UCS2 builds:

  * UTF-8
  * UTF-16
  * UTF-32
  * unicode-escape
  * raw-unicode-escape

> You either commit yourself fully to UTF-16 and surrogate pairs or not. Not
> some form in-between, because that will ultimately lead to more confusion
> due to the difference in results when dealing with Unicode.

Programmers will have to be aware of the fact that on UCS2
builds of Python non-BMP code points will have to be treated
differently than on UCS4 builds.

I don't see that as a problem. It is in a way similar to
32-bit vs. 64-bit builds of Python or the fact that floating point
numbers work differently depending on the Python platform or
compiler being used.

BTW: Have you ever run into any problems with UCS2 vs. UCS4
in practice that were not easy to solve ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 03 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2008-07-07: EuroPython 2008, Vilnius, Lithuania             3 days to go

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611