[Python-3000] String comparison
Stephen J. Turnbull
turnbull at sk.tsukuba.ac.jp
Fri Jun 8 05:31:44 CEST 2007
Rauli Ruohonen writes:
> > I think the default case should be that text operations produce the
> > expected result in the text domain, even at the expense of array
> > invariants.
> If you really want that, then you need a type for sequences of graphemes.
No. "Text" != "sequence of graphemes". For example:
> E.g. 'c\u0308' is already normalized according to all four normalization
> rules, but it's still one grapheme ('c' with diaeresis, c~)
Not on my terminal, it's not; it's two. And what about audible
Python cannot compute graphemes, the Python user can only observe them
after some other process displays them. So Python's definition of
"text" cannot be grapheme-based.
> > People who need arrays of code points have several ways to get them,
> > and the usual comparison operators will work on them as desired.
> But regexps and other string operations won't,
I do not have any objection to treating Unicode strings as sequences
of code points, and allowing them to be unnormalized -- as an option.
The *default* should be to treat them as text, or there should be a
simple way to make it default ("import trueunicode"). I do not want
to have to check every string for normalization by hand. I don't
object to the overhead---the overhead is already pretty high for
Unicode conformance. It's that I know I'll make mistakes, or use
libraries that do undocumented I/O or non-Unicode-conformant
transformations, or whatever. The right place to do such checking is
in the Unicode datatype, not in application code.
> > While people who need operations on *text* still have no
> > straightforward way to get them, and no promise of one as I read your
> > remarks.
> Then you missed some of his earlier remarks:
> : I'm all for adding a way to do normalized string comparisons to the
> : library. But I'm not about to change the == operator to apply
> : normalization first.
Funny, that's precisely the remark I was thinking of.
If I write a Unicode string, I want the == operator to "just work".
As quoted, Guido says it will not. Note that we *already* have a way
to do normalized string comparisons via unicodedata, and we can even
use "==" for it. So Guido would have every right to consider his
promise already fulfilled.
The problem is not that a code-point oriented operator won't work if
you know you have two TrueText objects; you only have to implement
them correctly, and code-point comparison Just Works. The problem is
that it's going to be very hard to be sure that you've got TrueText as
opposed to arrays of shorts if the *language* does not provide ways to
enforce the distinction.
More information about the Python-3000