[Python-3000] String comparison

"Martin v. Löwis" martin at v.loewis.de
Thu Jun 14 00:18:25 CEST 2007


> Thanks for clearing that up. It sounds like we really use code units,
> not code points (except when building with the 4-byte Unicode option,
> when they are equivalent). Is there anywhere were we use code points,
> apart from the UTF-8 codecs, which encode properly matched surrogate
> pairs as a single code point?

The literal syntax also supports it: \U00010000 is supported even
in a narrow build, and gets transparently encoded to the corresponding
two code units; likewise for repr(). There is an SF patch to make
unicodedata.lookup suport them also.

> Is it correct to say that a surrogate in UCS-16 is two code units
> representing a single code point?

That's my understanding, yes.

> Apart from the surrogates, are there code points that aren't
> characters? Are there characters that don't have a representation as a
> single code point? (I know some characters have multiple
> representations, some of which use multiple code points.)

[assuming you mean "code unit" again]
Not in the Unicode type, no. In the byte string type, this happens
all the time with multi-byte encodings.

[assuming you really mean "code point" in the first question]
There are numerous unassigned code points in Unicode, i.e. they
don't represent a character *yet*. There are also several code
points that are "noncharacters", in particular U+FFFE and
U+FFFF. These are permanently reserved and should never be
interpreted as abstract characters (rule C5). FFFE is reserved
because it is the byte-toggled BOM; I believe FFFF is reserved
so that APIs can use -1 as an error value. (FWIW, U+FFFD *is*
assigned and means "REPLACEMENT CHARACTER", �).

As for "combining characters": I think the Unicode terminology
really is that they are separate characters. They get combined
into a single grapheme, and different character sequences might
be considered as equivalent under canonical forms - but the
decomposed ö (o + combining diaeresis) actually is understood
as a two-character (i.e. two-codepoint) sequence.

Whether that matches the intuitive definition of "character",
I don't know - and I'm sure somebody will correct me if I
presented it incorrectly.

Regards,
Martin


More information about the Python-3000 mailing list