[Python-3000] String comparison
Stephen J. Turnbull
turnbull at sk.tsukuba.ac.jp
Wed Jun 6 14:33:19 CEST 2007
Rauli Ruohonen writes:
> Strings are internal to Python. This is a whole separate issue from
> normalization of source code or its parts (such as identifiers).
Agreed. But please note that we're not talking about representation.
We're talking about the result of evaluating a comparison:
if u"L\u00F6wis" == u"Lo\u0308wis":
print "Python is Unicode conforming in this respect."
else:
print "I guess it's time to start learning Ruby."
I think it's reasonable to be astonished if Python doesn't at least
try to print "Python is Unicode conforming in this respect." for the
above snippet by default.
> It is up to Python to define what "==" means, just like it defines
> what "is" means.
You are of course correct. However, if given that u prefix Python
chooses to define == in a way that does not respect canonical
equivalence, what's the point of having these things?
> Always doing normalization would still force you to use bytes for
> processing code point sequences (e.g. XML, which must not be
> normalized), which is not nice.
I'm not talking about "nice" yet, just about Unicode conformance. How
to implement conformant behavior is of course entirely up to Python.
As is choosing *whether* to conform or not, but it seems bizarre to me
that one might choose to implement UAX#31 verbatim, and also have
u"L\u00F6wis" == u"Lo\u0308wis" evaluate to False.
> FWIW, I don't buy that normalization is expensive, as most strings are
> in NFC form anyway, and there are fast checks for that (see UAX#15,
> "Detecting Normalization Forms"). Python does not currently have
> a fast path for this, but if it's added, then normalizing everything
> to NFC should be fast.
If O(n) is "fast".
More information about the Python-3000
mailing list