[I18n-sig] How does Python Unicode treat surrogates?

Fredrik Lundh fredrik@pythonware.com
Tue, 26 Jun 2001 09:05:07 +0200


mvl wrote:

> With Fredrik's solution, you'ld have to rebuild your Python interpreter
> with a 32-bit Unicode type to represent the characters. With that
> option, we'ld delegate the decision to administrators and Python
> distributors. If their users demand support for the additional
> characters, they will need to consider wasting space.

my suggestion is to prepare the Unicode subsystem for
sizeof(Py_UNICODE) >= 4 *today*, and make the switch
to UCS-4 when the time is right [1].

UTF-16 is an encoding format, not a storage format, so as
long as sizeof(Py_UNICODE) is 2, there will be no support for
surrogates beyond what's already in there [2].

</F>

1) imho, that time is "as soon as the unicode subsystem
is ready".

2) the U escape, plus some codecs, already support it:

>>> u"\U0010ffff"
u'\uDBFF\uDFFF'
>>> unicode("\xf4\x8f\xbf\xbf", "utf-8")
u'\uDBFF\uDFFF'