[I18n-sig] Re: How does Python Unicode treat surrogates?

Gaute B Strokkenes gs234@cam.ac.uk
25 Jun 2001 13:03:31 +0100

[I'm cc:-ing the unicode list to make sure that I've gotten my
terminology right, and to solicit comments

On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> Tim Peters wrote:
>> [M.-A. Lemburg]
>> > ...
>> > 2. What to do when slicing of Unicode strings would break
>> >    a surrogate pair ?
>> To me a string is a sequence of characters, and s[0] returns the
>> first, s[1] the second, and so on.  The internal details of how the
>> implementation chooses to torture itself <0.7 wink> should be
>> invisible.  That is, breaking a surrogate via slicing should be
>> impossible: s[i:j] returns j-i characters, and that's that.
> It's not that simple: lone surrogates are true Unicode char points
> in their own right; it's just that they are pretty useless without
> their resp. partners in the data stream. And with this "feature"
> they are in good company: the Unicode combining characters (e.g. the
> combining acute) have th same property.

This is completely and totally wrong.  The Unicode standard version
3.1 states (conformance requirement C12(c): A conformant process shall
not interpret illegal UTF code unit sequences as characters.

The precise definition of "illegal" in this context is given
elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:

  0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
  value of the right form, it is illegal.

(Unicode here should read UTF-16, off course.  The reason it does not
is that the language of the technical report has not been updated to
that of 3.1)

Big Gaute                               http://www.srcf.ucam.org/~gs234/
Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..