[I18n-sig] Unicode surrogates: just say no!

Paul Prescod paulp@ActiveState.com
Tue, 26 Jun 2001 13:31:08 -0700

Guido van Rossum wrote:
> I expect that not all Unicode users will be ready to embrace UCS-4.  I
> don't want to hear people say "I don't want to upgrade to Python 2.2
> because it wastes 4 bytes per Unicode character, but all I ever do is
> bandy around basic plane characters.  Given that there's currently
> very limited need for characters outside the basic plane, I want to be
> able to say that Python 2.2 is UCS-4 ready, but not that it always
> uses it.

I'm not dead-set against this but I want to point out that binary
distributors are probably not going to bother shipping two different
binaries. So the silent majority of Python users who download
precompiled binaries are going to have a "flag day" where Python changes
its default behaviour.

Given infinite resources, I'd rather see "best of both worlds"
implementations such as a flag on the Unicode object that chooses its
internal representation (i.e. a speed tweak for the knowledgable) or
objects that "fall back" from ASCII to UCS-2 to UCS-4 depending on the
input data. Or even a unicode32() data type that was interoperable with
unicode16. (and the default could change from one to the other someday)

I accept that in a world of finite resources there may be nobody
interested enough to put in that effort but I'd rather see the option
excluded on that basis rather than just because the code becomes more
complex. The code complexity would be worth it if it prevents a minor
fork in Python and varying behavior on different Pythons.
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook