[I18n-sig] Re: How does Python Unicode treat surrogates?

Machin, John JMachin@Colonial.com.au
Mon, 25 Jun 2001 22:33:50 +1000


MAL and Gaute,

Can I please take the middle ground (and risk having both of you throw
things at me?

=> Lone surrogates are not 'true Unicode char points
 in their own right' [MAL] -- they don't represent characters. 

On the other hand, UTF code sequences that would decode into lone surrogates
are not "illegal".
Please read clause D29 in section 3.8 of the Unicode 3.0 standard. This is
further clarified by Unicode 3.1
which expressly lists legal UTF-8 sequences; these encompass lone
surrogates.


-----Original Message-----
From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk]
Sent: Monday, 25 June 2001 22:04
To: M.-A. Lemburg
Cc: Tim Peters; i18n-sig@python.org; unicode@unicode.org
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?



[I'm cc:-ing the unicode list to make sure that I've gotten my
terminology right, and to solicit comments

On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> Tim Peters wrote:
>> 
>> [M.-A. Lemburg]
>> > ...
>> > 2. What to do when slicing of Unicode strings would break
>> >    a surrogate pair ?
>> 
>> To me a string is a sequence of characters, and s[0] returns the
>> first, s[1] the second, and so on.  The internal details of how the
>> implementation chooses to torture itself <0.7 wink> should be
>> invisible.  That is, breaking a surrogate via slicing should be
>> impossible: s[i:j] returns j-i characters, and that's that.
> 
> It's not that simple: lone surrogates are true Unicode char points
> in their own right; it's just that they are pretty useless without
> their resp. partners in the data stream. And with this "feature"
> they are in good company: the Unicode combining characters (e.g. the
> combining acute) have th same property.

This is completely and totally wrong.  The Unicode standard version
3.1 states (conformance requirement C12(c): A conformant process shall
not interpret illegal UTF code unit sequences as characters.

The precise definition of "illegal" in this context is given
elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:

  0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
  value of the right form, it is illegal.

(Unicode here should read UTF-16, off course.  The reason it does not
is that the language of the technical report has not been updated to
that of 3.1)

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..

_______________________________________________
I18n-sig mailing list
I18n-sig@python.org
http://mail.python.org/mailman/listinfo/i18n-sig


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************