[Python-3000] String comparison

Fri Jun 8 05:31:44 CEST 2007

Rauli Ruohonen writes:

 Stephen wrote:

 > > I think the default case should be that text operations produce the
 > > expected result in the text domain, even at the expense of array
 > > invariants.
 > 
 > If you really want that, then you need a type for sequences of graphemes.

No.  "Text" != "sequence of graphemes".  For example:

 > E.g. 'c\u0308' is already normalized according to all four normalization
 > rules, but it's still one grapheme ('c' with diaeresis, c~)

Not on my terminal, it's not; it's two.  And what about audible
representation?

Python cannot compute graphemes, the Python user can only observe them
after some other process displays them.  So Python's definition of
"text" cannot be grapheme-based.

 > > People who need arrays of code points have several ways to get them,
 > > and the usual comparison operators will work on them as desired.
 > 
 > But regexps and other string operations won't,

I do not have any objection to treating Unicode strings as sequences
of code points, and allowing them to be unnormalized -- as an option.

The *default* should be to treat them as text, or there should be a
simple way to make it default ("import trueunicode").  I do not want
to have to check every string for normalization by hand.  I don't
object to the overhead---the overhead is already pretty high for
Unicode conformance.  It's that I know I'll make mistakes, or use
libraries that do undocumented I/O or non-Unicode-conformant
transformations, or whatever.  The right place to do such checking is
in the Unicode datatype, not in application code.

 > > While people who need operations on *text* still have no
 > > straightforward way to get them, and no promise of one as I read your
 > > remarks.
 > 
 > Then you missed some of his earlier remarks:
 > 
 > Guido:

 > : I'm all for adding a way to do normalized string comparisons to the
 > : library. But I'm not about to change the == operator to apply
 > : normalization first.

Funny, that's precisely the remark I was thinking of.

If I write a Unicode string, I want the == operator to "just work".
As quoted, Guido says it will not.  Note that we *already* have a way
to do normalized string comparisons via unicodedata, and we can even
use "==" for it.  So Guido would have every right to consider his
promise already fulfilled.

The problem is not that a code-point oriented operator won't work if
you know you have two TrueText objects; you only have to implement
them correctly, and code-point comparison Just Works.  The problem is
that it's going to be very hard to be sure that you've got TrueText as
opposed to arrays of shorts if the *language* does not provide ways to
enforce the distinction.