[Python-3000] String comparison

Rauli Ruohonen rauli.ruohonen at gmail.com
Wed Jun 6 10:50:08 CEST 2007


(Martin's right, it's not good to discuss this in the huge PEP 3131
thread, so I'm changing the subject line)

On 6/6/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> In the language of these standards, I would expect that string
> comparison is exactly the kind of higher-level process they have in
> mind.  In fact, it is given as an example in what Jim quoted above.
>
>  > > C9 A process shall not assume that the interpretations of two
>  > > canonical-equivalent character sequences are distinct.
>  >
>  > Right. What is "a process"?
>
> Anything that accepts Unicode on input or produces it on output, and
> claims to conform to the standard.

Strings are internal to Python. This is a whole separate issue from
normalization of source code or its parts (such as identifiers). Once
you have read in a text file and done the normalizations you want to,
what you have left is an internal representation in memory, which
may be anything that's convenient to the programmer. The question is,
what is convenient to the programmer?

>  > > Ideally, an implementation would always interpret two
>  > > canonical-equivalent character sequences identically. There are
>  >
>  > So it should be the application's choice.
>
> I don't think so.  I think the kind of practical circumstance they
> have in mind is (eg) a Unicode document which is PGP-signed.  PGP
> clearly will not be able to verify a canonicalized document, unless it
> happened to be in canonical form when transmitted.  But I think it is
> quite clear that they do not admit that an implementation might return
> False when evaluating u"L\u00F6wis" == u"Lo\u0308wis".

It is up to Python to define what "==" means, just like it defines
what "is" means. It may be canonical equivalence for strings, but
then again it may not. It depends on what you want, and what you
think strings are. If you think they're sequences of code points,
which is what they act like in general (assuming UTF-32 was selected
at compile time), then bitwise comparison is quite consistent whether
the string is in normalized form or not.

Handling strings as sequences of code points is the most general and
simple thing to do, but there are other options. One is to simply
change comparison to be collation (and presumably also make regexp
matching and methods like startswith consistent with that). Another is
to always keep strings in a specific normalized form. Yet another is to
have another type for strings-as-grapheme-sequences, which would
strictly follow user expectations for characters (= graphemes), such as
string length and indexing, comparison, etc.

Changing just the comparison has the drawback that many current string
invariants break. a == b would no longer imply any of len(a) == len(b),
set(a) == set(b), a[i:j] == b[i:j], repr(a) == repr(b). You'd also
have to use bytes for any processing of code point sequences (such as
XML processing), because most operations would act as if you had
normalized your strings (including dictionary and set operations), and
if you have to do contortions to avoid problems with that, then it's
easier to just use bytes. There would also be security implications with
strings comparing equal but not always quite acting equal.

Always doing normalization would still force you to use bytes for
processing code point sequences (e.g. XML, which must not be
normalized), which is not nice. It's also not nice to force a particular
normalization on the programmer, as another one may be better for some
uses. E.g. an editor may be simpler to implement if everything is
consistently decomposed (NFD), but for most communication you'd want to
use NFC, as you would for many other processing (e.g. the "code point ==
grapheme" equation is perfectly adequate for many purposes with NFC,
but not with NFD).

Having a type for grapheme sequences would seem like the least
problematic choice, but there's little demand for such a type.
Most intelligent Unicode processing doesn't use a grapheme
representation for performance reasons, and in most other cases the
"code point == grapheme" equation or treatment of strings as atoms is
adequate. The standard library might provide this type if necessary.

>  > So this *allows* to canonicalize strings, it doesn't *require* Python
>  > to do so. Indeed, doing so would be fairly expensive, and therefore
>  > it should not be done (IMO).
>
> It would be much more expensive to make all string comparisons grok
> canonical equivalence.  That's why it *allows* canonicalization.

FWIW, I don't buy that normalization is expensive, as most strings are
in NFC form anyway, and there are fast checks for that (see UAX#15,
"Detecting Normalization Forms"). Python does not currently have
a fast path for this, but if it's added, then normalizing everything
to NFC should be fast.


More information about the Python-3000 mailing list