[Python-Dev] UCS2/UCS4 default

Sat Jul 5 01:20:34 CEST 2008

Martin v. Löwis <martin <at> v.loewis.de> writes:

> 
> > Wrong term - code units and code points are equivalent in UTF-16 and
> > UTF-32.  What you're looking for is unicode scalar values.
> 
> How so? Section 2.5, UTF-16 says
> 
> "code points in the supplementary planes, in the range
> U+10000..U+10FFFF, are represented as pairs of 16-bit code units."
> 
> So clearly, code points in Unicode range from U+0000..U+10FFFF,
> independent of encoding form.
> 
> In UTF-16, code units range from 0..65535.
> 
> OTOH, "unicode scalar value" is nearly synonymous to "code point":
> 
> D76 Unicode Scalar Value. Any Unicode  code point except high-surrogate
> and low-surrogate code points.
> 
> So codepoint in Terry's message was the right term.
> 

No Terry did definitely mean Unicode scalar values. He was describing the "pure"
but impractical "len()" that would count a surrogate pair as "1", not 2, even in
the 32-bit builds.

For what it is worth:
Code point: a number between 0 and 1114111.
Scalar Value: a code point, except the surrogate code points.
Code unit: The basic unit of the encoding. One code unit is always sufficient to
encode some Unicode Scalar values. However, other Unicode scalar values may
require multiple Code units.

Note that a scalar value is a code point. A code point may or may not be a
scalar value. 

Practical len() returns the number of code units of the internal storage format.
Pure len() allegedly would return the number of Unicode scalar values (obviously
a surrogate pair would be considered a single Unicode scalar value).

Please keep in mind that encodings encode Unicode scalar values. Thus a utf-8
code unit sequence (or UTF-32 code unit) that would give a code point in the
surrogate sections is technically in error. (Although python would do well to
ignore this restriction as there may be valid reasons to have a utf-8 sequence
that is not a valid encoded Unicode text sequence)