[Python-3000] String comparison

Guido van Rossum guido at python.org
Thu Jun 7 02:47:38 CEST 2007


On 6/6/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 6/6/07, Guido van Rossum <guido at python.org> wrote:
>
> > > about normalization of data strings.  The big issue is string literals.
> > > I think I agree with Stephen here:
>
> > >     u"L\u00F6wis" == u"Lo\u0308wis"
>
> > > should be True (assuming he typed it correctly in the first place :-),
> > > because they are the same Unicode string.
>
> > So let me explain it. I see two different sequences of code points:
> > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> > claim they are equivalent.
>
> Your (conforming) editor can silently replace one with the other.

No it cannot. We are talking about \u escapes, not about a string
literal containing Unicode characters ("Löwis").

> A second editor can silently use one, and not replace the other.
> ==> Uncontrollable, invisible bugs.

No. Seems you're again not reading before posting. :-(

> > They are two different sequences of code points.
>
> So "str" is about bytes, rather than text?
> and bytes is also about bytes; it just happens to be mutable?

Bytes are not code points. The unicode string type has always been
about code points, not characters.

> Then what was the point of switching to unicode?  Why not just say
> "When printed, a string will be interpreted as if it were UTF-8" and
> be done with it?

Manipulating code points is a lot more convenient than manipulating UTF-8.

> > We should not hide that Python's unicode string object can
> > store each sequence of code points equally well, and that when viewed
> > as a sequence they are different: the first has len() == 5, the scond
> > has len() == 6!
>
> For a bytes object, that is true.  For unicode text, they shouldn't be
> different -- at least not by the time a user can see it (or measure
> it).

Have you ever even used the unicode string type in Python 2?

> > I might be writing either literal with the expectation to get exactly that
> > sequence of code points,
>
> Then you are assuming non-conformance with unicode, which requires you
> not to depend on that distinction.  You should have used bytes, rather
> than text.

Again, bytes != code points.

> http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (Conformance)
>
> C9 A process shall not assume that the interpretations of two
> canonical-equivalent character sequences are distinct.

That is surely contained inside all sorts of weasel words that allow
us to define a "normalized equivalence" function that works that way,
and leave the "==" operator for arrays of code points alone.

> > Note that I'm not arguing against normalization of *identifiers*. I
> > see that as a necessity. I also see that there will be border cases
> > where getattr(x, 'XXX') and x.XXX are not equivalent for some values
> > of XXX where the normalized form is a different sequence of code
> > points. But I don't believe the solution should be to normalize all
> > string literals.
>
> For strings created by an extension module, that would be valid.  But
> python source code is human-readable text, and should be treated that
> way.  Either follow the unicode rules (at least for strings), or don't
> call them unicode.

Again, did you realize that the example was about \u escapes?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list