[Python-Dev] PEP 393 Summer of Code Project

Guido van Rossum guido at python.org
Thu Aug 25 04:29:39 CEST 2011


On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull
<turnbull at sk.tsukuba.ac.jp> wrote:
> Terry Reedy writes:
>
>  > Please suggest a re-wording then, as it is a bug for doc and behavior to
>  > disagree.
>
>    Strings contain Unicode code units, which for most purposes can be
>    treated as Unicode characters.  However, even as "simple" an
>    operation as "s1[0] == s2[0]" cannot be relied upon to give
>    Unicode-conforming results.
>
> The second sentence remains true under PEP 393.

Really? If strings contain code units, that expression compares code
units. What is non-conforming about comparing two code points? They
are just integers.

Seriously, what does Unicode-conforming mean here? It would be better
to specify chapter and verse (e.g. is it a specific thing defined by
the dreaded TR18?)

>  > >   >  For the purpose of my sentence, the same thing in that code points
>  > >   >  correspond to characters,
>  > >
>  > > Not in Unicode, they do not.  By definition, a small number of code
>  > > points (eg, U+FFFF) *never* did and *never* will correspond to
>  > > characters.
>  >
>  > On computers, characters are represented by code points. What about the
>  > other way around? http://www.unicode.org/glossary/#C says
>  > code point:
>  > 1) i in range(0x11000) <broad definition>
>  > 2) "A value, or position, for a character" <narrow definition>
>  > (To muddy the waters more, 'character' has multiple definitions also.)
>  > You are using 1), I am using 2) ;-(.
>
> No, you're not.  You are claiming an isomorphism, which Unicode goes
> to great trouble to avoid.

I don't know that we will be able to educate our users to the point
where they will use code unit, code point, character, glyph, character
set, encoding, and other technical terms correctly. TBH even though
less than two hours ago I composed a reply in this thread, I've
already forgotten which is a code point and which is a code unit.

>  > I think you have it backwards. I see the current situation as the purity
>  > of the C code beating the practicality for the user of getting right
>  > answers.
>
> Sophistry.  "Always getting the right answer" is purity.

Eh? In most other areas Python is pretty careful not to promise to
"always get the right answer" since what is right is entirely in the
user's mind. We often go to great lengths of defining how things work
so as to set the right expectations. For example, variables in Python
work differently than in most other languages.

Now I am happy to admit that for many Unicode issues the level at
which we have currently defined things (code units, I think -- the
thingies that encodings are made of) is confusing, and it would be
better to switch to the others (code points, I think). But characters
are right out.

>  > > The thing is, that 90% of applications are not really going to care
>  > > about full conformance to the Unicode standard.
>  >
>  > I remember when Intel argued that 99% of applications were not going to
>  > be affected when the math coprocessor in its then new chips occasionally
>  > gave 'non-standard' answers with certain divisors.
>
> In the case of Intel, the people who demanded standard answers did so
> for efficiency reasons -- they needed the FPU to DTRT because
> implementing FP in software was always going to be too slow.  CPython,
> IMO, can afford to trade off because the implementation will
> necessarily be in software, and can be added later as a Python or C module.

It is not so easy to change expectations about O(1) vs. O(N) behavior
of indexing however. IMO we shouldn't try and hence we're stuck with
operations defined in terms of code thingies instead of (mostly
mythical) characters.

>  > I believe my scheme could be extended to solve [conformance for
>  > composing characters] also. It would require more pre-processing
>  > and more knowledge than I currently have of normalization. I have
>  > the impression that the grapheme problem goes further than just
>  > normalization.
>
> Yes and yes.  But now you're talking about database lookups for every
> character (to determine if it's a composing character).  Efficiency of
> a generic implementation isn't going to happen.

Let's take small steps. Do the evolutionary thing. Let's get things
right so users won't have to worry about code points vs. code units
any more. A conforming library for all things at the character level
can be developed later, once we understand things better at that level
(again, most developers don't even understand most of the subtleties,
so I claim we're not ready).

> Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's
> pronouncement, "indexing is going to be O(1)".

I still think that. It would be too big of a cultural upheaval to change it.

>  And Nick's point about
> non-uniform arrays is telling.  I have 20 years of experience with an
> implementation of text as a non-uniform array which presents an array
> API, and *everything* needs to be special-cased for efficiency, and
> *any* small change can have show-stopping performance implications.
>
> Python can probably do better than Emacs has done due to much better
> leadership in this area, but I still think it's better to make full
> conformance optional.

This I agree with (though if you were referring to me with
"leadership" I consider myself woefully underinformed about Unicode
subtleties). I also suspect that Unicode "conformance" (however
defined) is more part of a political battle than an actual necessity.
I'd much rather have us fix Tom Christiansen's specific bugs than
chase the elusive "standard conforming".

(Hey, I feel a QOTW coming. "Standards? We don't need no stinkin'
standards." http://en.wikipedia.org/wiki/Stinking_badges :-)

-- 
--Guido van Rossum (python.org/~guido)


More information about the Python-Dev mailing list