[Python-Dev] UCS2/UCS4 default

Adam Olsen rhamph at gmail.com
Thu Jul 3 19:21:24 CEST 2008


On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <mal at egenix.com> wrote:
> On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:
>>
>> -On [20080703 15:00], M.-A. Lemburg (mal at egenix.com) wrote:
>>>
>>> Unicode if full of combining code points - if you break such a sequence,
>>> the output will be just as wrong; regardless of UCS2 vs. UCS4.
>>
>> In my opinion you are confusing two related, but very separated things
>> here.
>> Combining characters have nothing to do with breaking up the encoding of a
>> single codepoint. Sure enough, if you arbitrary slice up codepoints that
>> consist of combining characters then your result is indeed odd looking.
>>
>> I never said that nor is that the point I am making.
>
> Please remember that lone surrogate pair code points are perfectly
> valid Unicode code points, nevertheless. Just as a lone combining
> code point is valid on its own.

That is a big part of these problems.  For all practical purposes, a
surrogate is like a UTF-8 code unit, and must be handled the same way,
so why the heck do they confuse everybody by saying "oh, it's a code
point too!"?


-- 
Adam Olsen, aka Rhamphoryncus


More information about the Python-Dev mailing list