[Python-3000] String comparison

"Martin v. Löwis" martin at v.loewis.de
Wed Jun 13 22:37:45 CEST 2007


>> Until one or more of the senior developers says otherwise, I'm going
>> to assume that.
> 
> Yeah, what's the difference between code units and points?

A code unit is the atomic base in some encoding. It is a single byte
in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit
quantity in UTF-32).

A code point is something that has a 1:1 relationship with a logical
character (in particular, a Unicode character).

In UCS-2, a code point can be represented in 16 bits, and you can
represent all BMP characters. The low and high surrogates don't
encode characters and are reserved.

In UCS-4, you need more than 16 bits to represent a code point.
For example, you might use UTF-16, where you can use a single
code unit for all BMP characters, and two of them for code points
above U+FFFF.

Ever since PEP 261, Python admits that the elements of a Unicode
string are code units, and that you might need more than one of
them (specifically, for non-BMP characters in a narrow build)
to represent a code point.

Regards,
Martin


More information about the Python-3000 mailing list