[Python-3000] String comparison

Jim Jewett jimjjewett at gmail.com
Thu Jun 14 00:23:24 CEST 2007


On 6/13/07, Guido van Rossum <guido at python.org> wrote:
> On 6/13/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:

> > A code point is something that has a 1:1 relationship with a logical
> > character (in particular, a Unicode character).

and

> > A code unit is the atomic base in some encoding. It is a single byte
> > in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit
> > quantity in UTF-32).
...

> Is it correct to say that a surrogate in UCS-16 is two code units
> representing a single code point?

Basically, assuming you meant both halves of the surrogate pair put
together.  "A" surrogate often refers to only one of them.

> Apart from the surrogates, are there code points that aren't
> characters?

Yes.  The BOM mark, for one.  Plenty of other code points are reserved
for private use, or not yet assigned, or never will be assigned.
There are also some that are explicitly not characters.
(U+FD00..U+FDEF), and some that might be debatable (unprinted control
characters, or U+FFFC: OBJECT REPLACEMENT CHARACTER)

> Are there characters that don't have a representation as a
> single code point? (I know some characters have multiple
> representations, some of which use multiple code points.)

There are plenty of (mostly archaic?) characters which don't (yet?)
have an assigned unicode code point.

There are also plenty of things that a native speaker may view as a
single character, but which unicode treats as (at most) a Named
Sequence.

-jJ


More information about the Python-3000 mailing list