[I18n-sig] How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 02:18:55 +0200


> Fredrik Lundh writes:
> > I'm sceptical -- I see very little reason to maintain that distinction.
> > let's use either UCS-2 or UCS-4 for the internal storage, stick to the
> > "character strings are character sequences" concept, and keep the
> > UTF-16 surrogate issue where it belongs: in the codecs.
> 
> How then is u"\U00200000" represented internally if you use UCS-2 as
> the internal storage representation?

I think the obvious answer is: It is not supported. It will give an
exception when you try to convert an UTF-8 or UTF-16 string that has
such a character, it will be an error if you pass a surrogate to
unichr, or in a \u literal.

That would simplify a lot, IMO, and only require support for a 32-bit
Py_UNICODE.

Of course, that would have to be done as a per-platform choice, to
avoid binary-incompatible extension modules.

Regards,
Martin