[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 13:42:29 -0400

> So what has been implemented is UCS-2, not UTF-16, and certainly not
> Unicode. Better to document u"" string literals as UCS-2, and not
> Unicode.

I'm sorry, but I don't see why it's UCS-2 any more or less than
UTF-16.  That's like arguing whether 8-bit strings contains ASCII or
UTF-8.  That's up to the application; the data type can be used for

> > It may change *eventually* -- when we switch to UCS-4 for the internal
> > representation.  Until then, the API will deal in 16-bit values that
> > may or may not be "characters".
> You don't need to switch to UCS-4 internally to implement what I'm
> suggesting.

But unless I misunderstand what it *is* that you are suggesting, the
O(1) indexing property can't be retained with your suggestion, and
that's out of the question.

> > I'd say that ideally the choice to have a 2 or 4 byte internal
> > representation (or no Unicode support at all, for some platforms like
> > PalmOS!) should be a configuration choice.
> I don't think it should be a configuration choice. That leads to
> incompatibilities between people's scripts. It's bad enough already
> with some things working with threaded versions of python and some not
> (e.g., Zope requires threading, but mod_python doesn't work if its
> turned on).

That turned out to be a myth, actually.  mod_python works fine with
threads on most platforms.

Anyway, code that specifically doesn't work when a particular feature
is turned *on* is rare.  Code that *requires* a specific feature is
common, of course, and I would think that Python's Unicode type is
useful as it is for applications that don't need the newer planes.

> BTW, Palm recently joined the Unicode Consortium, and Symbian has
> Unicode support.
> >Right now the implementation doesn't allow that choice at all, which
> >should be remedied -- maybe you can help by submitting patches?
> Touché.


--Guido van Rossum (home page: http://www.python.org/~guido/)