[Python-3000] String comparison

Tue Jun 12 12:27:45 CEST 2007

On 6/10/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> I think you misunderstand.  Anything in Unicode that is normative is
> about interchange.  Strings are also a means of interchange---between
> modules (separate Unicode processes) in a program (single OS process).

Like Martin said, "what is a process?" :-) If you have a module that uses
noncharacters to mean something and it documents that, then that may well
be useful to its users. In my mind everything in a Python program is
within a single Unicode process, unless you have a very high level
component which specifies otherwise in its API documentation.

> Your complaint about Python mixing "pseudo-UTF-16" with "pseudo-UCS-2"
> is precisely a statement that various modules in Python do not specify
> what encoding forms they purport to accept or emit.

Actually, I said that there's no way to always do the right thing as long
as they are mixed, but that was a too theoretical argument. Practically
speaking, there's little need to interpret surrogate pairs as two
code points instead of as one non-BMP code point. The best use case I
could come up with was reading in an ill-formed UTF-8 file to see what
makes it ill-formed, but that's best done using bytes anyway.

E.g. '\xed\xa0\x80\xed\xb0\x80\xf0\x90\x80\x80' decodes to
u'\ud800\udc00\U00010000' on both builds, but as on a UCS-2 build
u'\U00010000' == u'\ud800\udc00', the distinction is lost there.
Effectively the codec has decoded the first two code points to UCS-2
and the the last code point to UTF-16, forming a string which mixes
the two interpretations instead of using one of them consistently, and
because of that you can no longer recover the original code point stream.
But what the decoder should really do is raise an exception anyway, as
the input is ill-formed.

Java and C# (and thus Jython and IronPython too) also sometimes use
UCS-2, sometimes UTF-16. As long as it works as you expect, there isn't a
problem, really.

On UCS-4 builds of CPython it's the same (either UCS-4 or UTF-32 with the
extension that surrogates work as in UTF-16), but you get the extra
complication that some equal strings don't compare equal, e.g.
u'\U00010000' != u'\ud800\udc00'. Even that doesn't cause problems in
practice, because you shouldn't have strings like u'\ud800\udc00' in the
first place.