[Python-3000] String comparison

Thu Jun 14 09:43:55 CEST 2007

Jim Jewett writes:

 > > Apart from the surrogates, are there code points that aren't
 > > characters?

 > Yes.  The BOM mark, for one.

Nitpick: The BOM *is* a character (FEFF, aka ZERO-WIDTH NO-BREAK
SPACE).  Its byte-swapped counterpart FFFE is guaranteed *not* to be a
character.  (Martin wrote that correctly.)  FFFF is guaranteed *not*
to be a character; in fact all code points U that are equal to FFFE or
FFFF modulo 0x10000 are guaranteed not to be characters (ie, the last
two characters in each plane).

 > Plenty of other code points are reserved
 > for private use, or not yet assigned,

Or reserved for use as surrogates, and therefore should never appear
in UTF-8 or UTF-32 streams -- but if they do, AIUI they must be passed
on uninterpreted unless the API explicitly says what it does with them.

 > or never will be assigned.  There are also some that are explicitly
 > not characters.  (U+FD00..U+FDEF),

Guaranteed not to be assigned == not a character.  The special range
of non-characters is quite a bit smaller, FDD0..FDEF.

 > and some that might be debatable (unprinted control
 > characters, or U+FFFC: OBJECT REPLACEMENT CHARACTER)

Not a good idea to classify this way.  Those *are* characters, and a
process may interpret them or not.  Python (the language and the
stdlib, except where it explicitly says otherwise) definitely should
*not* worry about these things.  They're characters, that's the most
Python needs to know.

 > > Are there characters that don't have a representation as a
 > > single code point? (I know some characters have multiple
 > > representations, some of which use multiple code points.)

Not a question that can be answered without reference to a specific
application.  An application may treat each code point as a character,
or it may choose to compose code points (eg, into private space).

The most Python might want to do is deal with canonical equivalence,
but even then there are issues, such as the ö in the English word
coördinate.  I would consider the diaeresis as a separate diacritic
(meaning "don't pronounce as 'oo', pronounce as 'oh-oh'), not a
component of a single character.  There may be even clearer examples.

 > There are also plenty of things that a native speaker may view as a
 > single character, but which unicode treats as (at most) a Named
 > Sequence.

Eg, the New Line Function (Unicode's name for "universal newline"),
which can be any of the usual suspects (CR, LF, CRLF) depending on
context.