[Python-Dev] UCS2/UCS4 default
M.-A. Lemburg
mal at egenix.com
Thu Jul 3 15:57:41 CEST 2008
On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:
> -On [20080703 15:00], M.-A. Lemburg (mal at egenix.com) wrote:
>> Unicode if full of combining code points - if you break such a sequence,
>> the output will be just as wrong; regardless of UCS2 vs. UCS4.
>
> In my opinion you are confusing two related, but very separated things here.
> Combining characters have nothing to do with breaking up the encoding of a
> single codepoint. Sure enough, if you arbitrary slice up codepoints that
> consist of combining characters then your result is indeed odd looking.
>
> I never said that nor is that the point I am making.
Please remember that lone surrogate pair code points are perfectly
valid Unicode code points, nevertheless. Just as a lone combining
code point is valid on its own.
> Guido points out that Python supports surrogate pairs and says that if
> Python is dealing wrongly with this in the core than it needs to be fixed.
> I am pointing out that given the fact we allow surrogate pairs we deal
> rather simplistic with it in the core. In fact, we do not consider them at
> all. In essence: though we may accept full 21-bit codepoints in the form of
> \U00000000 escape sequences and store them internally as UTF-16 (which I
> still need to verify) we subsequently deal with them programmatically as
> UCS-2, which is plain silly.
Python applies conversion from non-BMP code points to surroagtes
for UCS builds in a few places and I agree that we should probably
do that at a few more places.
However, these are mainly conversion issues of encoded Unicode
representations vs. the internal Unicode storage where you want
to avoid exceptions in favor of finding a solution that preserves
data.
To make it clear: UCS2 builds of Python do not support non-BMP
code points out of the box.
A programmer will always have to use a codec to map the internal storage
on these builds to the full Unicode code point range. The following
codecs support surrogates on UCS2 builds:
* UTF-8
* UTF-16
* UTF-32
* unicode-escape
* raw-unicode-escape
> You either commit yourself fully to UTF-16 and surrogate pairs or not. Not
> some form in-between, because that will ultimately lead to more confusion
> due to the difference in results when dealing with Unicode.
Programmers will have to be aware of the fact that on UCS2
builds of Python non-BMP code points will have to be treated
differently than on UCS4 builds.
I don't see that as a problem. It is in a way similar to
32-bit vs. 64-bit builds of Python or the fact that floating point
numbers work differently depending on the Python platform or
compiler being used.
BTW: Have you ever run into any problems with UCS2 vs. UCS4
in practice that were not easy to solve ?
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Jul 03 2008)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2008-07-07: EuroPython 2008, Vilnius, Lithuania 3 days to go
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
More information about the Python-Dev
mailing list