[Python-Dev] UCS2/UCS4 default

Thu Jul 3 20:50:57 CEST 2008

On Thu, Jul 3, 2008 at 11:35 AM, Jeroen Ruigrok van der Werven
<asmodai at in-nomine.org> wrote:
> -On [20080703 19:21], Adam Olsen (rhamph at gmail.com) wrote:
>>On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <mal at egenix.com> wrote:
>>> Please remember that lone surrogate pair code points are perfectly
>>> valid Unicode code points, nevertheless. Just as a lone combining
>>> code point is valid on its own.
>>
>>That is a big part of these problems.  For all practical purposes, a
>>surrogate is like a UTF-8 code unit, and must be handled the same way,
>>so why the heck do they confuse everybody by saying "oh, it's a code
>>point too!"?
>
> Because surrogate code points are not Unicode scalar values, isolated UTF-16
> code units in the range 0xd800-0xdfff are ill-formed. (D91 from Unicode
> 5.0/5.1, section 3.9)
>
> So, no, it is not a code point too.

UTF-16
D91 UTF-16 encoding form: The Unicode encoding form that assigns each
Unicode scalar
     value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned
     16-bit code unit with the same numeric value as the Unicode
scalar value, and that
     assigns each Unicode scalar value in the range U+10000..U+10FFFF
to a surrogate
     pair, according to Table 3-5.
   • In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is represented
     as <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to
     U+10302.
   • Because surrogate code points are not Unicode scalar values,
isolated UTF-16
     code units in the range D80016..DFFF16 are ill-formed.

In the context of UTF-8 or UTF-32, a Unicode scalar value is a single
code point of a valid character (more or less) and a code unit is the
base unit (1 and 4 bytes respectively) of which 1 or more combine to
form a code point.  In UTF-16, code point becomes synonymous with code
unit and Unicode scalar value becomes one or more code points.  WTF?

-- 
Adam Olsen, aka Rhamphoryncus