[Python-3000] String comparison

Wed Jun 13 23:05:21 CEST 2007

On 6/13/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> >> Until one or more of the senior developers says otherwise, I'm going
> >> to assume that.
> >
> > Yeah, what's the difference between code units and points?
>
> A code unit is the atomic base in some encoding. It is a single byte
> in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit
> quantity in UTF-32).
>
> A code point is something that has a 1:1 relationship with a logical
> character (in particular, a Unicode character).
>
> In UCS-2, a code point can be represented in 16 bits, and you can
> represent all BMP characters. The low and high surrogates don't
> encode characters and are reserved.
>
> In UCS-4, you need more than 16 bits to represent a code point.
> For example, you might use UTF-16, where you can use a single
> code unit for all BMP characters, and two of them for code points
> above U+FFFF.
>
> Ever since PEP 261, Python admits that the elements of a Unicode
> string are code units, and that you might need more than one of
> them (specifically, for non-BMP characters in a narrow build)
> to represent a code point.

Thanks for clearing that up. It sounds like we really use code units,
not code points (except when building with the 4-byte Unicode option,
when they are equivalent). Is there anywhere were we use code points,
apart from the UTF-8 codecs, which encode properly matched surrogate
pairs as a single code point?

Is it correct to say that a surrogate in UCS-16 is two code units
representing a single code point?

Apart from the surrogates, are there code points that aren't
characters? Are there characters that don't have a representation as a
single code point? (I know some characters have multiple
representations, some of which use multiple code points.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)