[Python-Dev] UCS2/UCS4 default

Jeroen Ruigrok van der Werven asmodai at in-nomine.org
Thu Jul 3 15:21:46 CEST 2008


-On [20080703 15:00], M.-A. Lemburg (mal at egenix.com) wrote:
>Unicode if full of combining code points - if you break such a sequence,
>the output will be just as wrong; regardless of UCS2 vs. UCS4.

In my opinion you are confusing two related, but very separated things here.
Combining characters have nothing to do with breaking up the encoding of a
single codepoint. Sure enough, if you arbitrary slice up codepoints that
consist of combining characters then your result is indeed odd looking.

I never said that nor is that the point I am making.

Guido points out that Python supports surrogate pairs and says that if
Python is dealing wrongly with this in the core than it needs to be fixed.
I am pointing out that given the fact we allow surrogate pairs we deal
rather simplistic with it in the core. In fact, we do not consider them at
all. In essence: though we may accept full 21-bit codepoints in the form of
\U00000000 escape sequences and store them internally as UTF-16 (which I
still need to verify) we subsequently deal with them programmatically as
UCS-2, which is plain silly.

You either commit yourself fully to UTF-16 and surrogate pairs or not. Not
some form in-between, because that will ultimately lead to more confusion
due to the difference in results when dealing with Unicode.

-- 
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Believe in Angels...


More information about the Python-Dev mailing list