[Python-3000] String comparison
Stephen J. Turnbull
turnbull at sk.tsukuba.ac.jp
Thu Jun 7 09:34:51 CEST 2007
Josiah Carlson writes:
> Maybe I'm missing something, but it seems to me that there might be a
> simple solution. Don't normalize any identifiers or strings.
That's not a solution, that's denying that there's a problem.
> Hear me out for a moment. People type what they want.
You're thinking in ASCII terms still, where code points == characters.
With Unicode, what they see is the *single character* they *want*, but
it may be represented by a half-dozen characters in RAM, a different
set of characters in the file, and they may have typed a dozen hard-
to-relate keystrokes to get it (eg, typing a *phonetic prefix* of a
word whose untyped trailing character is the one they want). And if
everything that handles the text is Unicode conformant, it doesn't
matter! In that context, just what does "people type what they want"
mean?
By analogy, suppose I want to generate a table (such as Martin's
table-331.html) algorithmically. Then doesn't it seem reasonable that
the representation might be something like u"\u0041"? But you know
what that sneaky ol' Python 2.5 does to me if I evaluate it? It
returns u'A'! And guess what else? u"\u0041" == u'A' returns True!
And when I print either of them, I see what I expect: A.
Well, what Unicode-conformant editors are allowed to do with NKD and
NKC (and all the non-normalized forms as well) is quite analogous.
But a conformant process is expected not to distinguish among them,
just as two instances of Python are expected to compare those two
*different* string literals as equal. Thus it doesn't matter (for
most purposes) what those editors do, just as it doesn't matter
(except as a point of style) how you spell u"A".
> As for strings, I think we should opt for keeping it as simple as
> possible. Compare by code points.
If you normalize on the way in, you can do that *correctly*. If you
don't ...
> To handle normalization issues, add a normalization method that
> people call if they care about normalized unicode strings*.
...you impose the normalization on application programmers who think
of unicode strings as internationalized text (but they aren't! they're
arrays of unsigned shorts), or on module writers who have weak
incentive to get 100% coverage. Note that these programs don't crash;
they silently give false negatives. Fixing these bugs *before*
selling the code is hard and expensive; who will care to do it?
Eg, *you*. You clearly *don't* care in your daily work, even though
you are sincerely trying to understand on python-dev. But your (quite
proper!) objective is to lower costs to you and your code since YAGNI.
Where *I* need it, I will cross you off my list of acceptable vendors
(of off-the-shelf modules, I can't afford your consulting rates).
Well and good, that's how it *should* work. But your (off-the-shelf)
modules will possibly see use by the Japanese Social Security
Administration, who have demonstrated quite graphically how little
they care[1]. :-(
Furthermore, there are typically an awful lot of ways that a string
can get into the process, and if you do care, you want to catch them
all.
This is a lot easier to do *in* the Python compiler and interpreter,
which have a limited number of I/O channels, than it will be to do for
a large library of modules, not all of which even exist at this date.
> * Or leave out normalization all together in 3.0 . I haven't heard any
> complaints about the lack of normalization in Python so far (though
> maybe I'm not reading the right python-list messages), and Python has
> had unicode for what, almost 10 years now?
I presented a personal anecdote about docutils in my response to GvR,
and an failed test from XEmacs (which, admittedly, Python already gets
right). Strictly speaking the former is not a normalization issue,
since it's probably a fairly idiosyncratic change in docutils, but
it's the kind of problem that would be mitigated by normalization.
But you won't see a lot, because almost all text in Western European
languages is almost automatically NFC, unless somebody who knows what
they're doing deliberately denormalizes or renormalizes it (as in Mac
OS X). Also, a lot of problems will get attributed to legacy
encodings, although proper attention to canonical (and a subset of
compatibility) equivalences would go a long way to resolve them.
These issues are going to become more prevalent as more scripts are
added to Unicode, and actually come into use. And as their users
start deploying IT on a large scale for the first time.
Footnotes:
[1] About 20 million Japanese face partial or total loss of their
pensions because the Japanese SSA couldn't be bothered to canonicalize
their names accurately when the system was automated in the '90s.
More information about the Python-3000
mailing list