[Python-3000] String comparison
"Martin v. Löwis"
martin at v.loewis.de
Wed Jun 13 22:37:45 CEST 2007
>> Until one or more of the senior developers says otherwise, I'm going
>> to assume that.
>
> Yeah, what's the difference between code units and points?
A code unit is the atomic base in some encoding. It is a single byte
in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit
quantity in UTF-32).
A code point is something that has a 1:1 relationship with a logical
character (in particular, a Unicode character).
In UCS-2, a code point can be represented in 16 bits, and you can
represent all BMP characters. The low and high surrogates don't
encode characters and are reserved.
In UCS-4, you need more than 16 bits to represent a code point.
For example, you might use UTF-16, where you can use a single
code unit for all BMP characters, and two of them for code points
above U+FFFF.
Ever since PEP 261, Python admits that the elements of a Unicode
string are code units, and that you might need more than one of
them (specifically, for non-BMP characters in a narrow build)
to represent a code point.
Regards,
Martin
More information about the Python-3000
mailing list