[I18n-sig] Re: Unicode surrogates: just say no!
Gaute B Strokkenes
gs234@cam.ac.uk
27 Jun 2001 00:15:26 +0100
On Tue, 26 Jun 2001, guido@digicool.com wrote:
> Let me use this as an excuse to start a discussion on how far we
> should go in ruling out illegal code points.
>
> I think that *codecs* would be wise to be picky about illegal code
> points (except for the special UTF-16-pass-through option).
>
> But I think that the *datatype implementation* should allow storage
> units to take every possible value, whether or not it's illegal
> according to Unicode, either in isolation or in context. It's much
> easier to implement that way, and I believe that the checks ought to
> be in other tools.
I think that it is a good idea to allow users to stick any scalar
value that will fit into the internal representation into a Python
Unicode string, and that unichr(some value > 0xFFFF) should return a
Unicode string with len(unichr(some value > 0xFFFF)) = 2 when UCS-2 is
being used. There are a few issues that need to be considered,
however:
1) Sort order. Unicode strings should sort in Unicode lexicographical
order. With UCS-4 this is easy; just compare the Py_UNICODE values
one by one like C does with strcmp(). With UTF-16 this is more
complicated when surrogates get involved. Basically, you go
through the strings being compared until you find the first
difference. If both characters at this point are in the BMP or
both are high surrogates, just compare them as usual. However, if
one is in the BMP and the other is a surrogate, you need to make
sure that the string with the surrogate in it sorts after the one
with the BMP character. Straight comparison won't work since there
are characters in the BMP with numerical values greater than those
of surrogates.
I believe that this is the right thing to do when Py_UNICODE is
UCS-2 since the added complexity is only O(1) per string comparison
and is very easy to implement. This will ensure that
cmp(unichr(0xFFFD), unichr(0x10ABCD)) will work consistently and
correctly for both UCS-2 and UCS-4.
2) There is an incompatibility between the two approaches since
unichr(high surrogate) + unichr(low surrogate) will magically be
the same as unichr(the approriate astral codepoint) when UCS-2 is
used. With UCS-4 they will not; it will result in a string with
two values that have no well-defined meaning.
I don't think this is a show-stopper, but people will need to be
made aware.
> PEP time?
Quite possibly...
--
Big Gaute http://www.srcf.ucam.org/~gs234/
.. does your DRESSING ROOM have enough ASPARAGUS?