[Python-Dev] PEP 393 Summer of Code Project

Thu Aug 25 09:36:06 CEST 2011

>>    Strings contain Unicode code units, which for most purposes can be
>>    treated as Unicode characters.  However, even as "simple" an
>>    operation as "s1[0] == s2[0]" cannot be relied upon to give
>>    Unicode-conforming results.
>>
>> The second sentence remains true under PEP 393.
> 
> Really? If strings contain code units, that expression compares code
> units. What is non-conforming about comparing two code points? They
> are just integers.
> 
> Seriously, what does Unicode-conforming mean here?

I think he's referring to combining characters and normal forms. 2.12
starts with

"In cases involving two or more sequences considered to be equivalent,
the Unicode Standard does not prescribe one particular sequence as being
the  correct one; instead, each  sequence is merely equivalent to the
others"

That could be read to imply that the == operator should determine
whether two strings are equivalent. However, the Unicode standard
clearly leaves API design to the programming environment, and has
the notion of conformance only for processes. So saying that Python
is or is not unicode-conforming is, strictly speaking, meaningless.

The closest conformance requirement in that respect is C6

"A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct."

However, that explicitly does *not* support the conformance statement
that Stephen made. They elaborate

"Ideally, an implementation would always interpret two
canonical-equivalent  character sequences identically. There are
practical circumstances under which  implementations may reasonably
distinguish them."

So practicality beats purity even in Unicode conformance: the
== operator of Python can reasonably treat equivalent strings
as unequal (and there is a good reason for that, indeed). Processes
should not expect that other applications make the same distinction,
so they need to cope if it matters to them. There are different way
to do that:
- normalize all strings on input, and then use ==
- use a different comparison operation that always normalizes
  its input first

> This I agree with (though if you were referring to me with
> "leadership" I consider myself woefully underinformed about Unicode
> subtleties). I also suspect that Unicode "conformance" (however
> defined) is more part of a political battle than an actual necessity.

Fortunately, it's much better than that. Unicode had very clear
conformance requirements for a long time, and they aren't hard
to meet.

Wrt. C6, Python could certainly improve, e.g. by caching whether
a string had been determined to be in normal form, so that applications
can more reasonably apply normalization to all strings they ever
want to compare.

Regards,
Martin