[Python-3000] String comparison

Bill Janssen janssen at parc.com
Wed Jun 6 19:12:56 CEST 2007


> Hear me out for a moment.  People type what they want.

I do a lot of Pythonic processing of UTF-8, which is not "typed by
people", but instead extracted from documents by automated processing.
Text is also data -- an important thing to keep in mind.

As far as normalization goes, I agree with you about identifiers, and
I use "unicodedata.normalize" extensively in the cases where I care
about normalization of data strings.  The big issue is string literals.
I think I agree with Stephen here:

    u"L\u00F6wis" == u"Lo\u0308wis"

should be True (assuming he typed it correctly in the first place :-),
because they are the same Unicode string.  I don't understand Guido's
objection here -- it's a lexer issue, right?  The underlying character
string will still be the same in both cases.

But it's complicated.  Clearly we expect

    (u"abc" + u"def") == (u"a" + u"bcdef")

to be True, so

    (u"L" + u"\u00F6" + u"wis") == (u"Lo" + u"\u0308" + u"wis")

should also be True.  Where I see difficulty is

    (u"L" + unchr(0x00F6) + u"wis") == (u"Lo" + unichr(0x0308) + u"wis")

I suppose unichr(0x0308) should raise an exception -- a combining
diacritic by itself shouldn't be convertible to a character.

Bill




More information about the Python-3000 mailing list