[I18n-sig] Re: How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 15:21:36 +0200


Gaute B Strokkenes wrote:
> 
> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments
> 
> On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > Tim Peters wrote:
> >>
> >> [M.-A. Lemburg]
> >> > ...
> >> > 2. What to do when slicing of Unicode strings would break
> >> >    a surrogate pair ?
> >>
> >> To me a string is a sequence of characters, and s[0] returns the
> >> first, s[1] the second, and so on.  The internal details of how the
> >> implementation chooses to torture itself <0.7 wink> should be
> >> invisible.  That is, breaking a surrogate via slicing should be
> >> impossible: s[i:j] returns j-i characters, and that's that.
> >
> > It's not that simple: lone surrogates are true Unicode char points
> > in their own right; it's just that they are pretty useless without
> > their resp. partners in the data stream. And with this "feature"
> > they are in good company: the Unicode combining characters (e.g. the
> > combining acute) have th same property.
> 
> This is completely and totally wrong.  The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.

This would solve the UTF codec issue, but I was talking about Unicode
itself. In Python, you can write u"abc\uD800\uDC00"[0:4] giving
u"abc\uD800" without getting an exception and I am not sure whether
this is correct or not.

The internal machinery is a totally different issue: we currently
use UTF-16 for this but have deliberatly left out the surrogate
support for the first implementation phase.
 
> The precise definition of "illegal" in this context is given
> elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:
> 
>   0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
>   value of the right form, it is illegal.
> 
> (Unicode here should read UTF-16, off course.  The reason it does not
> is that the language of the technical report has not been updated to
> that of 3.1)

If you would have left it at "Unicode" I would have felt
better ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/