[Python-3000] String comparison

Fri Jun 8 15:38:13 CEST 2007

On 6/8/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> AFAIK, the only strings the Unicode standard absolutely prohibits
> emitting are those containing code points guaranteed not to be
> characters by the standard.

The ones it absolutely prohibits in interchange are surrogates. They
are also illegal in both UTF-16 and UTF-8. The pragmatic reason is that
if you do encode them despite their illegality (like Python codecs do),
strings won't always survive a round-trip to such pseudo-UTF-16
because multiple code point sequences necessarily map to the same byte
sequence. For some reason Python's UTF-8 encoder introduces this
ambiguity too, even though there's no need to do so with pseudo-UTF-8.

In Python UCS-2 builds even string processing in the core works
inconsistently with surrogates. Sometimes pseudo-UCS-2 is assumed,
sometimes pseudo-UTF-16, and these are incompatible because
pseudo-UTF-16 can't always represent surrogates, but pseudo-UCS-2 can.
OTOH pseudo-UCS-2 can't represent code points outside the BMP, but
pseudo-UTF-16 can. There's no way to always do the right thing as long
as these two are mixed, but somebody somewhere probably depends on this
behavior.

Other than surrogates, there are two classes of characters with
"restricted interchange". One is reserved characters, which need to
be preserved if found in text for compatibility with future versions of
the standard. Another is noncharacters, which are "reserved for
internal use, such as for sentinel values". These should obviously be
allowed, as the user may want to use them internally in their Python
program.

> So there's nothing "wrong by definition" about defining strings as
> sequences of code points, and string operations in code-point-wise
> fashion. It just makes that library for Unicode more expensive to
> design and operate, and will require auditing and reimplementation of
> common libraries (including the standard library) by every program
> that requires strict Unicode conformance.

It's not perfect, but that's the state of the art. AFAIK this (or worse)
is what the other implementations do. Even the Unicode standard
explains that strings generally work that way:

  2.7. Unicode Strings

  A Unicode string datatype is simply an ordered sequence of code
  units. Thus a Unicode 8-bit string is an ordered sequence of
  8-bit code units, a Unicode 16-bit string is an ordered sequence
  of 16-bit code units, and a Unicode 32-bit string is an ordered
  sequence of 32-bit code units.

  Depending on the programming environment, a Unicode string may or
  may not also be required to be in the corresponding Unicode encoding
  form. For example, strings in Java, C#, or ECMAScript are Unicode
  16-bit strings, but are not necessarily well-formed UTF-16 sequences.