[Python-Dev] PEP 393 Summer of Code Project

Thu Aug 25 06:12:17 CEST 2011

Guido van Rossum writes:

 > On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull
 > <turnbull at sk.tsukuba.ac.jp> wrote:

 > >    Strings contain Unicode code units, which for most purposes can be
 > >    treated as Unicode characters.  However, even as "simple" an
 > >    operation as "s1[0] == s2[0]" cannot be relied upon to give
 > >    Unicode-conforming results.
 > >
 > > The second sentence remains true under PEP 393.
 > 
 > Really? If strings contain code units, that expression compares code
 > units.

That's true out of context, but in context it's "which for most
purposes can be treated as Unicode characters", and this is what Terry
is concerned with, as well.

 > What is non-conforming about comparing two code points?

Unicode conformance means treating characters correctly.  In
particular, s1 and s2 might be NFC and NFD forms of the same string
with a combining character at s2[1], or s1[1] and s[2] might be a
non-combining character and a combining character respectively.

 > Seriously, what does Unicode-conforming mean here?

Chapter 3, all verses.  Here, specifically C6, p. 60.  One would have
to define the process executing "s1[0] == s2[0]" to be sure that even
in the cases cited in the previous paragraph non-conformance is
occurring, but one example of a process where that is non-conforming
(without additional code to check for trailing combining characters)
is in comparison of Vietnamese filenames generated on a Mac vs. those
generated on a Linux host.

 > > No, you're not.  You are claiming an isomorphism, which Unicode goes
 > > to great trouble to avoid.
 > 
 > I don't know that we will be able to educate our users to the point
 > where they will use code unit, code point, character, glyph, character
 > set, encoding, and other technical terms correctly.

Sure.  I got it wrong myself earlier.

I think that the right thing to do is to provide a conformant
implementation of Unicode text in the stdlib (a long run goal, see
below), and call that "Unicode", while we call strings "strings".

 > Now I am happy to admit that for many Unicode issues the level at
 > which we have currently defined things (code units, I think -- the
 > thingies that encodings are made of) is confusing, and it would be
 > better to switch to the others (code points, I think).

Yes, and AFAICT (I'm better at reading standards than I am at reading
Python implementation) PEP 393 allows that.

 > But characters are right out.

+1

 > It is not so easy to change expectations about O(1) vs. O(N) behavior
 > of indexing however. IMO we shouldn't try and hence we're stuck with
 > operations defined in terms of code thingies instead of (mostly
 > mythical) characters.

Well, O(N) is not really the question.  It's really O(log N), as Terry
says.  Is that out, too?  I can verify that it's possible to do it in
practice in the long term.  In my experience with Emacs, even with 250
MB files, O(log N) mostly gives acceptable performance in an
interactive editor, as well as many scripted textual applications.

The problems that I see are

(1) It's very easy to write algorithms that would be O(N) for a true
    array, but then become O(N log N) or worse (and the coefficient on
    the O(log N) algorithm is way higher to start).  I guess this
    would kill the idea, but.

(2) Maintenance is fragile; it's easy to break the necessary caches
    with feature additions and bug fixes.  (However, I don't think
    this would be as big a problem for Python, due to its more
    disciplined process, as it has been for XEmacs.)

You might think space for the caches would be a problem, but that has
turned out not to be the case for Emacsen.

 > Let's take small steps. Do the evolutionary thing. Let's get things
 > right so users won't have to worry about code points vs. code units
 > any more. A conforming library for all things at the character level
 > can be developed later, once we understand things better at that level
 > (again, most developers don't even understand most of the subtleties,
 > so I claim we're not ready).

I don't think anybody does.  That's one reason there's a new version
of Unicode every few years.

 > This I agree with (though if you were referring to me with
 > "leadership" I consider myself woefully underinformed about Unicode
 > subtleties).

<wink/>  MvL and MAL are not, however, and there are plenty of others
who make contributions -- in an orderly fashion.

 > I also suspect that Unicode "conformance" (however defined) is more
 > part of a political battle than an actual necessity.  I'd much
 > rather have us fix Tom Christiansen's specific bugs than chase the
 > elusive "standard conforming".

Well, I would advocate specifying which parts of the standard we
target and which not (for any given version).  The goal of full
"Chapter 3" conformance should be left up to a library on PyPI for the
nonce IMO.  I agree that fixing specific bugs should be given
precedence over "conformance chasing," but implementation should
conform to the appropriate parts of the standard.

 > (Hey, I feel a QOTW coming. "Standards? We don't need no stinkin'
 > standards." http://en.wikipedia.org/wiki/Stinking_badges :-)

RMS beat you to that.  Not good company to be in, in this case: he
specifically disclaims the goal of portability to non-GNU-System
systems.