[Python-3000] String comparison
Bill Janssen
janssen at parc.com
Wed Jun 6 19:12:56 CEST 2007
> Hear me out for a moment. People type what they want.
I do a lot of Pythonic processing of UTF-8, which is not "typed by
people", but instead extracted from documents by automated processing.
Text is also data -- an important thing to keep in mind.
As far as normalization goes, I agree with you about identifiers, and
I use "unicodedata.normalize" extensively in the cases where I care
about normalization of data strings. The big issue is string literals.
I think I agree with Stephen here:
u"L\u00F6wis" == u"Lo\u0308wis"
should be True (assuming he typed it correctly in the first place :-),
because they are the same Unicode string. I don't understand Guido's
objection here -- it's a lexer issue, right? The underlying character
string will still be the same in both cases.
But it's complicated. Clearly we expect
(u"abc" + u"def") == (u"a" + u"bcdef")
to be True, so
(u"L" + u"\u00F6" + u"wis") == (u"Lo" + u"\u0308" + u"wis")
should also be True. Where I see difficulty is
(u"L" + unchr(0x00F6) + u"wis") == (u"Lo" + unichr(0x0308) + u"wis")
I suppose unichr(0x0308) should raise an exception -- a combining
diacritic by itself shouldn't be convertible to a character.
Bill
More information about the Python-3000
mailing list