[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 10:22:37 CEST 2011

Terry Reedy writes:

 > The current UCS2 Unicode string implementation, by design, quickly gives 
 > WRONG answers for len(), iteration, indexing, and slicing if a string 
 > contains any non-BMP (surrogate pair) Unicode characters. That may have 
 > been excusable when there essentially were no such extended chars, and 
 > the few there were were almost never used.

Well, no, it gives the right answer according to the design.  unicode
objects do not contain character strings.  By design, they contain
code point strings.  Guido has made that absolutely clear on a number
of occasions.  And the reasons have very little to do with lack of
non-BMP characters to trip up the implementation.  Changing those
semantics should have been done before the release of Python 3.

It is not clear to me that it is a good idea to try to decide on "the"
correct implementation of Unicode strings in Python even today.  There
are a number of approaches that I can think of.

1.  The "too bad if you can't take a joke" approach: do nothing and
    recommend UTF-32 to those who want len() to DTRT.
2.  The "slope is slippery" approach: Implement UTF-16 objects as
    built-ins, and then try to fend off requests for correct treatment
    of unnormalized composed characters, normalization, compatibility
    substitutions, bidi, etc etc.
3.  The "are we not hackers?" approach: Implement a transform that
    maps characters that are not represented by a single code point
    into Unicode private space, and then see if anybody really needs
    more than 6400 non-BMP characters.  (Note that this would
    generalize to composed characters that don't have a one-code-point
    NFC form and similar non-standardized cases that nonstandard users
    might want handled.)
4.  The "42" approach: sadly, I can't think deeply enough to explain it.

There are probably others.

It's true that Python is going to need good libraries to provide
correct handling of Unicode strings (as opposed to unicode objects).
But it's not clear to me given the wide variety of implementations I
can imagine that there will be one best implementation, let alone
which ones are good and Pythonic, and which not so.