[Python-Dev] PEP 393 Summer of Code Project

Thu Aug 25 13:58:30 CEST 2011

"Martin v. Löwis" writes:

 > Am 25.08.2011 11:39, schrieb Stephen J. Turnbull:
 > > "Martin v. Löwis" writes:
 > > 
 > >  > No, that's explicitly *not* what C6 says. Instead, it says that a
 > >  > process that treats s1 and s2 differently shall not assume that others
 > >  > will do the same, i.e. that it is ok to treat them the same even though
 > >  > they have different code points. Treating them differently is also
 > >  > conforming.
 > > 
 > > Then what requirement does C6 impose, in your opinion? 
 > 
 > In IETF terminology, it's a weak SHOULD requirement. Unless there are
 > reasons not to, equivalent strings should be treated differently. It's
 > a weak requirement because the reasons not to treat them equivalent are
 > wide-spread.

There are no "weak SHOULDs" and no "wide-spread reasons" in RFC 2119.
RFC 2119 specifies "particular circumstances" and "full implications"
that are "carefully weighed" before varying from SHOULD behavior.

IMHO the Unicode Standard intends a full RFC 2119 "SHOULD" here.

 > Yes, but that's the operating system's choice first of all.  Some
 > operating systems do allow file names in a single directory that
 > are equivalent yet use different code points. Python then needs to
 > support this operating system, despite the permission of the
 > Unicode standard to ignore the difference.

Sure, and that's one of several such reasons why I think the PEP's
implementation of unicodes as arrays of code points is an optimal
balance.  But the Unicode standard does not "permit" ignoring the
difference here, except in the sense that *the Unicode standard
doesn't apply at all* and therefore doesn't forbid it.  The OSes in
question are not conforming processes, and presumably don't claim to
be.

Because most of the processes Python interacts with won't be
conforming processes (not even the majority of textual applications,
for a while), Python does not need to be, and *should not* be, a
conforming Unicode process for most of what it does.  Not even for
much of its text processing.

Also, to the extent that Python is a general-purpose language, I see
nothing wrong and lots of good in having a non-conformant code point
array type as the platform for implementing conforming Unicode
library(ies).

But this is not user/developer-friendly at all:

 > Wrt. normalization, I think all that's needed is already there.
 > Applications just need to normalize all strings to a normal form of
 > their liking, and be done. That's easier than using a separate library
 > throughout the code base (let alone using yet another string type).

But many users have never heard of normalization.  And that's *just*
normalization.  There is a whole raft of other requirements for
conformance (collation, case, etc).

The point is that with such a library and string type, various aspects
of conformance to Unicode, as well as conformance to associated
standards (eg, the dreaded UTS #18 ;-) can be added to the library
over time, and most users (those who don't need to squeeze every ounce
of performance out of Python) can be blissfully unaware of what, if
anything, they're conforming to.  Just upgrade the library to get the
best Unicode support (in terms of conformance) that Python has to
offer.

But for the reasons you (and Guido and Nick and ...) give, it's not
reasonable to put all that into core Python, not anytime soon.  Not to
mention that as a work-in-progress, it can hardly be considered stable
enough for the stdlib.

That is what Terry Reedy is getting at, AIUI.  "Batteries included"
should mean as much Unicode conformance as we can reasonably provide
should be *conveniently* available.  The ideal (given the caveat about
efficiency) would be *one* import statement and a ConformingUnicode type
that acts "just like a string" in all ways, except that (1) it indexes
and counts on characters (preferably "grapheme clusters" :-), (2) does
collation, regexps, and the like conformant to the Unicode standard,
and (3) may be quite inefficient from the point of view of bit-
shoveling net applications and the like.

Of course most of (2) is going to take quite a while, but (1) and (3)
should not be that hard to accomplish (especially (3) ;-).

 > > I'm simply saying that the current implementation of strings, as
 > > improved by PEP 393, can not be said to be conforming.
 > 
 > I continue to disagree. The Unicode standard deliberately allows
 > Python's behavior as conforming.

That's up to you.  I doubt very many users or application developers
will see it that way, though.  I think they would prefer that we be
conservative about what we call "conformant", and tell them precisely
what they need to do to get what they consider conformant behavior
from Python.  That's easier if we share definitions of conformant with
them.  And surely there would be great joy on the battlements if there
were a one-import way to spell "all the Unicode conformance you can
give me, please".

The problem with your legalistic approach, as I see it, is that if our
definition is looser than the users', all their surprises will be
unpleasant.  That's not good.