[Python-3000] String comparison
Guido van Rossum
guido at python.org
Wed Jun 6 19:37:47 CEST 2007
On 6/6/07, Bill Janssen <janssen at parc.com> wrote:
> > Hear me out for a moment. People type what they want.
>
> I do a lot of Pythonic processing of UTF-8, which is not "typed by
> people", but instead extracted from documents by automated processing.
> Text is also data -- an important thing to keep in mind.
>
> As far as normalization goes, I agree with you about identifiers, and
> I use "unicodedata.normalize" extensively in the cases where I care
> about normalization of data strings. The big issue is string literals.
> I think I agree with Stephen here:
>
> u"L\u00F6wis" == u"Lo\u0308wis"
>
> should be True (assuming he typed it correctly in the first place :-),
> because they are the same Unicode string. I don't understand Guido's
> objection here -- it's a lexer issue, right? The underlying character
> string will still be the same in both cases.
So let me explain it. I see two different sequences of code points:
'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
'w', 'i', 's' on the other. Never mind that Unicode has semantics that
claim they are equivalent. They are two different sequences of code
points. We should not hide that Python's unicode string object can
store each sequence of code points equally well, and that when viewed
as a sequence they are different: the first has len() == 5, the scond
has len() == 6! When read from a file they are different. Why should
the lexer apply normalization to literals behind my back? I might be
writing either literal with the expectation to get exactly that
sequence of code points, in order to use it as a test case or as input
for another program that requires specific input.
> But it's complicated. Clearly we expect
>
> (u"abc" + u"def") == (u"a" + u"bcdef")
>
> to be True, so
>
> (u"L" + u"\u00F6" + u"wis") == (u"Lo" + u"\u0308" + u"wis")
>
> should also be True. Where I see difficulty is
>
> (u"L" + unchr(0x00F6) + u"wis") == (u"Lo" + unichr(0x0308) + u"wis")
>
> I suppose unichr(0x0308) should raise an exception -- a combining
> diacritic by itself shouldn't be convertible to a character.
There's a simpler solution. The unicode (or str, in Py3k) data type
represents a sequence of code points, not a sequence of characters.
This has always been the case, and will continue to be the case.
Note that I'm not arguing against normalization of *identifiers*. I
see that as a necessity. I also see that there will be border cases
where getattr(x, 'XXX') and x.XXX are not equivalent for some values
of XXX where the normalized form is a different sequence of code
points. But I don't believe the solution should be to normalize all
string literals. Clearly we will have a normalization routine so the
lexer can normalize identifiers, so if you need normalized data it is
as simple as writing 'XXX'.normalize() (or whatever the spelling
should be).
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-3000
mailing list