[I18n-sig] Re: How does Python Unicode treat surrogates?
Gaute B Strokkenes
gs234@cam.ac.uk
25 Jun 2001 13:03:31 +0100
[I'm cc:-ing the unicode list to make sure that I've gotten my
terminology right, and to solicit comments
On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> Tim Peters wrote:
>>
>> [M.-A. Lemburg]
>> > ...
>> > 2. What to do when slicing of Unicode strings would break
>> > a surrogate pair ?
>>
>> To me a string is a sequence of characters, and s[0] returns the
>> first, s[1] the second, and so on. The internal details of how the
>> implementation chooses to torture itself <0.7 wink> should be
>> invisible. That is, breaking a surrogate via slicing should be
>> impossible: s[i:j] returns j-i characters, and that's that.
>
> It's not that simple: lone surrogates are true Unicode char points
> in their own right; it's just that they are pretty useless without
> their resp. partners in the data stream. And with this "feature"
> they are in good company: the Unicode combining characters (e.g. the
> combining acute) have th same property.
This is completely and totally wrong. The Unicode standard version
3.1 states (conformance requirement C12(c): A conformant process shall
not interpret illegal UTF code unit sequences as characters.
The precise definition of "illegal" in this context is given
elsewhere. See <http://www.unicode.org/unicode/reports/tr17/>:
0xD800 is incomplete in Unicode. Unless followed by another 16-bit
value of the right form, it is illegal.
(Unicode here should read UTF-16, off course. The reason it does not
is that the language of the technical report has not been updated to
that of 3.1)
--
Big Gaute http://www.srcf.ucam.org/~gs234/
Hello? Enema Bondage? I'm calling because I want to be happy, I guess..