[Python-3000] String comparison

Stephen J. Turnbull stephen at xemacs.org
Sat Jun 9 06:33:07 CEST 2007


Rauli Ruohonen writes:

 > The ones it absolutely prohibits in interchange are surrogates.

Excuse me?  Surrogates are code points with a specific interpretation
if it is "purported that the stream is in UTF-16".  Otherwise, Unicode
4.0 explicitly says that there is nothing illegal about an isolated
surrogate (p.75, where an example is given of how such a surrogate
might occur).  That surrogate may not be interpreted as an abstract
character (C4, p.58), but it is not a non-character (Table 2-2, p.25).

I agree that it's unfortunate that some parts of Python treat Unicode
strings objects purely as sequences of Unicode code points, and others
purport (apparently without checking) that such strings are in UTF-16.
Unicode conformance is not part of the Python language.  That's life.

But let's try to avoid creating difficulties that don't exist in the
standard.

 > > So there's nothing "wrong by definition" about defining strings as
 > > sequences of code points, and string operations in code-point-wise
 > > fashion.

 > It's not perfect, but that's the state of the art. AFAIK this (or worse)
 > is what the other implementations do.

My point was precisely that I don't object to this implementation.  I
want Unicode-ly-correct behavior to be a goal of the language, the
community disagrees, and Guido disagrees.  That's that.

Thanks you for starting work on implementation; let's concentrate on
that.


More information about the Python-3000 mailing list