[I18n-sig] How does Python Unicode treat surrogates?
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Sat, 23 Jun 2001 10:26:26 +0200
[Uche]
> Sure. I admit it's hearsay, but I thought I'd read that because Java
> Unicode is or was underspecified, that there was the possibility of
> transposition of the high-surrogate with the low-surrogate character
> between Java implementations or platforms.
I've tried to find out what problem that could be. So far, I found
http://developer.java.sun.com/developer/bugParade/bugs/4344266.html
Here, they complain that the codecs don't properly check for
surrogates that straddle invocations of convert, or get incorrect
surrogate pairs. There is a bug report on SF that Python has similar
problems.
http://developer.java.sun.com/developer/bugParade/bugs/4328816.html
summarizes problems that have been fixed with surrogates in UTF-8,
again, similar problems are probably present in Python.
There were also a few bug reports about surrogates working differently
depending on locale (fail in zh_CN, pass in C), and type of virtual
machine (fail in classic, pass in hotspot).
I could not find any report on a bug where surrogates are output in
incorrect order.
[Guido]
> On the XML sig the following exchange happened. I don't know enough
> about the issues to investigate, but I'm sure that someone here can
> provide insight? It seems to boil down to whether or not surrogates
> may get transposed when between platforms.
I very much doubt this could ever happen.
Regards,
Martin