[I18n-sig] Re: Unicode surrogates: just say no!

Gaute B Strokkenes gs234@cam.ac.uk
27 Jun 2001 00:15:26 +0100


On Tue, 26 Jun 2001, guido@digicool.com wrote:
> Let me use this as an excuse to start a discussion on how far we
> should go in ruling out illegal code points.
> 
> I think that *codecs* would be wise to be picky about illegal code
> points (except for the special UTF-16-pass-through option).
> 
> But I think that the *datatype implementation* should allow storage
> units to take every possible value, whether or not it's illegal
> according to Unicode, either in isolation or in context.  It's much
> easier to implement that way, and I believe that the checks ought to
> be in other tools.

I think that it is a good idea to allow users to stick any scalar
value that will fit into the internal representation into a Python
Unicode string, and that unichr(some value > 0xFFFF) should return a
Unicode string with len(unichr(some value > 0xFFFF)) = 2 when UCS-2 is
being used.  There are a few issues that need to be considered,
however:

1) Sort order.  Unicode strings should sort in Unicode lexicographical
   order.  With UCS-4 this is easy; just compare the Py_UNICODE values
   one by one like C does with strcmp().  With UTF-16 this is more
   complicated when surrogates get involved.  Basically, you go
   through the strings being compared until you find the first
   difference.  If both characters at this point are in the BMP or
   both are high surrogates, just compare them as usual.  However, if
   one is in the BMP and the other is a surrogate, you need to make
   sure that the string with the surrogate in it sorts after the one
   with the BMP character.  Straight comparison won't work since there
   are characters in the BMP with numerical values greater than those
   of surrogates.

   I believe that this is the right thing to do when Py_UNICODE is
   UCS-2 since the added complexity is only O(1) per string comparison
   and is very easy to implement.  This will ensure that
   cmp(unichr(0xFFFD), unichr(0x10ABCD)) will work consistently and
   correctly for both UCS-2 and UCS-4.

2) There is an incompatibility between the two approaches since
   unichr(high surrogate) + unichr(low surrogate) will magically be
   the same as unichr(the approriate astral codepoint) when UCS-2 is
   used.  With UCS-4 they will not; it will result in a string with
   two values that have no well-defined meaning.

   I don't think this is a show-stopper, but people will need to be
   made aware.

> PEP time?

Quite possibly...

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
..  does your DRESSING ROOM have enough ASPARAGUS?