[I18n-sig] How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 01:47:18 +0200


> If this is the only thing that keeps us from having a configuration
> OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it.

I think there are numerous places which assume sizeof(Py_UNICODE)==2,
including, but not limited to, sre.

> But UTF-16 vs. UCS-4 is not an implementation detail!
> 
> If we store 4 bytes per character, we should treat surrogates
> differently.  I don't know where those would be converted -- probably
> in the UTF-16 to UCS-4 codec.

Indeed, they would never appear in a 32-bit Unicode string.

> > This is different: ISO 10646 is a competing standard, not just a 
> > different encoding.
> 
> Oh.  I didn't know.  How does it differ from Unicode?  What's the user
> acceptance?

To my knowledge, it only differs in minor points, which is only caused
by different release dates (at one time, Unicode is behind, at another
time, the ISO standard).

End users typically view it as Unicode, whereas standards bodies and
agencies typically view it as ISO 10646 (e.g. C, C++, and Posix all
refer to ISO 10646, Microsoft refers to Unicode).

Regards,
Martin