[Python-3000] String comparison

Bill Janssen janssen at parc.com
Thu Jun 7 03:57:40 CEST 2007


> So let me explain it. I see two different sequences of code points:
> 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> claim they are equivalent. They are two different sequences of code
> points.

If they were sequences of integers, or sequences of bytes, I'd agree
with you.  But they are explicitly sequences of characters, not
sequences of codepoints.  There should be one internal normalized form
for strings.

> We should not hide that Python's unicode string object can
> store each sequence of code points equally well, and that when viewed
> as a sequence they are different: the first has len() == 5, the scond
> has len() == 6!

We should definitely not expose that difference!

> When read from a file they are different.

A file is in UTF-8, or UTF-2, or whatever -- it contains a string
coerced to a sequence of bits.  Whatever reads that file should in
fact either preserve that sequence of bytes (in which case it's not a
string), or coerce it to a Unicode string, in which case the file
representation is immaterial and the Python normalized form is used
internally.

> I might be
> writing either literal with the expectation to get exactly that
> sequence of code points, in order to use it as a test case or as input
> for another program that requires specific input.

In that case you should write it as a sequence of integers, because
that's what you're dealing with.

> There's a simpler solution. The unicode (or str, in Py3k) data type
> represents a sequence of code points, not a sequence of characters.
> This has always been the case, and will continue to be the case.

Bad idea, IMO.

Bill


More information about the Python-3000 mailing list