[Python-Dev] PEP 393 Summer of Code Project
Stephen J. Turnbull
turnbull at sk.tsukuba.ac.jp
Wed Aug 24 10:22:37 CEST 2011
Terry Reedy writes:
> The current UCS2 Unicode string implementation, by design, quickly gives
> WRONG answers for len(), iteration, indexing, and slicing if a string
> contains any non-BMP (surrogate pair) Unicode characters. That may have
> been excusable when there essentially were no such extended chars, and
> the few there were were almost never used.
Well, no, it gives the right answer according to the design. unicode
objects do not contain character strings. By design, they contain
code point strings. Guido has made that absolutely clear on a number
of occasions. And the reasons have very little to do with lack of
non-BMP characters to trip up the implementation. Changing those
semantics should have been done before the release of Python 3.
It is not clear to me that it is a good idea to try to decide on "the"
correct implementation of Unicode strings in Python even today. There
are a number of approaches that I can think of.
1. The "too bad if you can't take a joke" approach: do nothing and
recommend UTF-32 to those who want len() to DTRT.
2. The "slope is slippery" approach: Implement UTF-16 objects as
built-ins, and then try to fend off requests for correct treatment
of unnormalized composed characters, normalization, compatibility
substitutions, bidi, etc etc.
3. The "are we not hackers?" approach: Implement a transform that
maps characters that are not represented by a single code point
into Unicode private space, and then see if anybody really needs
more than 6400 non-BMP characters. (Note that this would
generalize to composed characters that don't have a one-code-point
NFC form and similar non-standardized cases that nonstandard users
might want handled.)
4. The "42" approach: sadly, I can't think deeply enough to explain it.
There are probably others.
It's true that Python is going to need good libraries to provide
correct handling of Unicode strings (as opposed to unicode objects).
But it's not clear to me given the wide variety of implementations I
can imagine that there will be one best implementation, let alone
which ones are good and Pythonic, and which not so.
More information about the Python-Dev
mailing list