[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 18:34:17 CEST 2011

Terry Reedy writes:

 > Excuse me for believing the fine 3.2 manual that says
 > "Strings contain Unicode characters."

The manual is wrong, then, subject to a pronouncement to the contrary,
of course.  I was on your side of the fence when this was discussed,
pre-release.  I was wrong then.  My bet is that we are still wrong,
now.

 > For the purpose of my sentence, the same thing in that code points 
 > correspond to characters,

Not in Unicode, they do not.  By definition, a small number of code
points (eg, U+FFFF) *never* did and *never* will correspond to
characters.  Since about Unicode 3.0, the same is true of surrogate
code points.  Some restrictions have been placed on what can be done
with composed characters, so even with the PEP (which gives us code
point arrays) we do not really get arrays of Unicode characters that
fully conform to the model.

 > strings are NOT code point sequences. They are 2-byte code *unit* 
 > sequences.

I stand corrected on Unicode terminology.  "Code unit" is what I meant,
and what I understand Guido to have defined unicode objects as arrays of.

 > Any narrow build string with even 1 non-BMP char violates the
 > standard.

Yup.  That's by design.

 > > Guido has made that absolutely clear on a number
 > > of occasions.
 > 
 > It is not clear what you mean, but recently on python-ideas he has 
 > reiterated that he intends bytes and strings to be conceptually 
 > different.

Sure.  Nevertheless, practicality beat purity long ago, and that
decision has never been rescinded AFAIK.

 > Bytes are computer-oriented binary arrays; strings are 
 > supposedly human-oriented character/codepoint arrays.

And indeed they are, in UCS-4 builds.  But they are *not* in Unicode!
Unicode violates the array model.  Specifically, in handling composing
characters, and in bidi, where arbitrary slicing of direction control
characters will result in garbled display.

The thing is, that 90% of applications are not really going to care
about full conformance to the Unicode standard.  Of the remaining 10%,
90% are not going to need both huge strings *and* ABI interoperability
with C modules compiled for UCS-2, so UCS-4 is satisfactory.  Of the
remaining 1% of all applications, those that deal with huge strings
*and* need full Unicode conformance, well, they need efficiency too
almost by definition.  They probably are going to want something more
efficient than either the UTF-16 or the UTF-32 representation can
provide, and therefore will need trickier, possibly app-specific,
algorithms that probably do not belong in an initial implementation.

 >  > And the reasons have very little to do with lack of
 > > non-BMP characters to trip up the implementation.  Changing those
 > > semantics should have been done before the release of Python 3.
 > 
 > The documentation was changed at least a bit for 3.0, and anyway, as 
 > indicated above, it is easy (especially for new users) to read the docs 
 > in a way that makes the current behavior buggy. I agree that the 
 > implementation should have been changed already.

I don't.  I suspect Guido does not, even today.

 > Currently, the meaning of Python code differs on narrow versus wide
 > build, and in a way that few users would expect or want.

Let them become developers, then, and show us how to do it better.

 > PEP 393 abolishes narrow builds as we now know them and changes
 > semantics. I was answering a complaint about that change. If you do
 > not like the PEP, fine.

No, I do like the PEP.  However, it is only a step, a rather
conservative one in some ways, toward conformance to the Unicode
character model.  In particular, it does nothing to resolve the fact
that len() will give different answers for character count depending
on normalization, and that slicing and indexing will allow you to cut
characters in half (even in NFC, since not all composed characters
have fully composed forms).

 > > It is not clear to me that it is a good idea to try to decide on "the"
 > > correct implementation of Unicode strings in Python even today.
 > 
 > If the implementation is invisible to the Python user, as I believe it 
 > should be without specially introspection, and mostly invisible in the 
 > C-API except for those who intentionally poke into the details, then the 
 > implementation can be changed as the consensus on best implementation 
 > changes.

A naive implementation of UTF-16 will be quite visible in terms of
performance, I suspect, and performance-oriented applications will "go
behind the API's back" to get it.  We're already seeing that in the
people who insist that bytes are characters too, and string APIs
should work on them just as they do on (Unicode) strings.

 > > It's true that Python is going to need good libraries to provide
 > > correct handling of Unicode strings (as opposed to unicode objects).
 > 
 > Given that 3.0 unicode (string) objects are defined as Unicode character 
 > strings, I do not see the opposition.

I think they're not, I think they're defined as Unicode code unit
arrays, and that the documentation is in error.  If the documentation
is correct, then Python 3.0 was released about 5 years too early,
because correct handling of those objects as arrays of Unicode
characters has never been implemented or even discussed in terms of
proposed code that I know of.

Martin has long claimed that the fact that I/O is done in terms of
UTF-16 means that the internal representation is UTF-16, so I could be
wrong.  But when issues of slicing, len() values and so on have come
up in the past, Guido has always said "no, there will be no change in
semantics of builtins here".