[I18n-sig] How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sat, 23 Jun 2001 10:26:26 +0200


[Uche]
> Sure.  I admit it's hearsay, but I thought I'd read that because Java
> Unicode is or was underspecified, that there was the possibility of
> transposition of the high-surrogate with the low-surrogate character
> between Java implementations or platforms.

I've tried to find out what problem that could be. So far, I found

http://developer.java.sun.com/developer/bugParade/bugs/4344266.html

Here, they complain that the codecs don't properly check for
surrogates that straddle invocations of convert, or get incorrect
surrogate pairs. There is a bug report on SF that Python has similar
problems.

http://developer.java.sun.com/developer/bugParade/bugs/4328816.html

summarizes problems that have been fixed with surrogates in UTF-8,
again, similar problems are probably present in Python.

There were also a few bug reports about surrogates working differently
depending on locale (fail in zh_CN, pass in C), and type of virtual
machine (fail in classic, pass in hotspot).

I could not find any report on a bug where surrogates are output in
incorrect order.

[Guido]
> On the XML sig the following exchange happened.  I don't know enough
> about the issues to investigate, but I'm sure that someone here can
> provide insight?  It seems to boil down to whether or not surrogates
> may get transposed when between platforms.

I very much doubt this could ever happen.

Regards,
Martin