[Python-Dev] UCS2/UCS4 default

Guido van Rossum guido at python.org
Fri Jul 4 00:21:46 CEST 2008


On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen <rhamph at gmail.com> wrote:
> On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <tjreedy at udel.edu> wrote:
>>
>> The premise is the OP's idea that Python should switch to all UCS4 to create
>> a more pure ('ideal') situation or the idea that len(s) should count
>> codepoints (correct term?) for all builds as a matter of purity even though
>> on it would be time-costly on 16-bit builds as a matter of practicality.
>
> Wrong term - code units and code points are equivalent in UTF-16 and
> UTF-32.  What you're looking for is unicode scalar values.

I don't think so. I have in my lap the Unicode 5.0 standard, which on
page 102, under UTF-16, states (amongst others):

"""
* In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
represented as <004D 0439 4E8C D800 DF02>, where <D800 DF02>
corresponds to U+10302.

* Because surrogate code points are not Unicode scalar values,
isolated UTF-16 code units in the range D800[16]..DFFF[16] are
ill-formed.
"""

>From this I understand they distinguish carefully between code points
and code units -- D800 is a code unit but not a code point, 10302 is a
code point but not a (UTF-16) code unit.

OTOH outside the context of UTF-8, the surrogates are also referred to
as "reserved code points" (e.g. in Table 2-3 on page 27, "Types of
Code Points").

I think the best thing we can do is to use "code points" to refer to
characters and "code units" to the individual 16-bit values in the
UTF-16 encoding; this seems compatible with usage elsewhere in this
thread by most folks.

Also see http://unicode.org/glossary/:

"""
Code Point. Any value in the Unicode codespace; that is, the range of
integers from 0 to 10FFFF16. (See definition D10 in Section 3.4,
Characters and Encoding.)
.
.
.
Code Unit. The minimal bit combination that can represent a unit of
encoded text for processing or interchange. The Unicode Standard uses
8-bit code units in the UTF-8 encoding form, 16-bit code units in the
UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding
form. (See definition D77 in  Section 3.9, Unicode Encoding Forms.)
"""

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-Dev mailing list