[Python-3000] String comparison
Jim Jewett
jimjjewett at gmail.com
Thu Jun 7 02:38:51 CEST 2007
On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> > about normalization of data strings. The big issue is string literals.
> > I think I agree with Stephen here:
> > u"L\u00F6wis" == u"Lo\u0308wis"
> > should be True (assuming he typed it correctly in the first place :-),
> > because they are the same Unicode string.
> So let me explain it. I see two different sequences of code points:
> 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> claim they are equivalent.
Your (conforming) editor can silently replace one with the other.
A second editor can silently use one, and not replace the other.
==> Uncontrollable, invisible bugs.
> They are two different sequences of code points.
So "str" is about bytes, rather than text?
and bytes is also about bytes; it just happens to be mutable?
Then what was the point of switching to unicode? Why not just say
"When printed, a string will be interpreted as if it were UTF-8" and
be done with it?
> We should not hide that Python's unicode string object can
> store each sequence of code points equally well, and that when viewed
> as a sequence they are different: the first has len() == 5, the scond
> has len() == 6!
For a bytes object, that is true. For unicode text, they shouldn't be
different -- at least not by the time a user can see it (or measure
it).
> I might be writing either literal with the expectation to get exactly that
> sequence of code points,
Then you are assuming non-conformance with unicode, which requires you
not to depend on that distinction. You should have used bytes, rather
than text.
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (Conformance)
C9 A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct.
> Note that I'm not arguing against normalization of *identifiers*. I
> see that as a necessity. I also see that there will be border cases
> where getattr(x, 'XXX') and x.XXX are not equivalent for some values
> of XXX where the normalized form is a different sequence of code
> points. But I don't believe the solution should be to normalize all
> string literals.
For strings created by an extension module, that would be valid. But
python source code is human-readable text, and should be treated that
way. Either follow the unicode rules (at least for strings), or don't
call them unicode.
-jJ
More information about the Python-3000
mailing list