[I18n-sig] Re: How does Python Unicode treat surrogates?
M.-A. Lemburg
mal@lemburg.com
Mon, 25 Jun 2001 15:21:36 +0200
Gaute B Strokkenes wrote:
>
> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments
>
> On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > Tim Peters wrote:
> >>
> >> [M.-A. Lemburg]
> >> > ...
> >> > 2. What to do when slicing of Unicode strings would break
> >> > a surrogate pair ?
> >>
> >> To me a string is a sequence of characters, and s[0] returns the
> >> first, s[1] the second, and so on. The internal details of how the
> >> implementation chooses to torture itself <0.7 wink> should be
> >> invisible. That is, breaking a surrogate via slicing should be
> >> impossible: s[i:j] returns j-i characters, and that's that.
> >
> > It's not that simple: lone surrogates are true Unicode char points
> > in their own right; it's just that they are pretty useless without
> > their resp. partners in the data stream. And with this "feature"
> > they are in good company: the Unicode combining characters (e.g. the
> > combining acute) have th same property.
>
> This is completely and totally wrong. The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.
This would solve the UTF codec issue, but I was talking about Unicode
itself. In Python, you can write u"abc\uD800\uDC00"[0:4] giving
u"abc\uD800" without getting an exception and I am not sure whether
this is correct or not.
The internal machinery is a totally different issue: we currently
use UTF-16 for this but have deliberatly left out the surrogate
support for the first implementation phase.
> The precise definition of "illegal" in this context is given
> elsewhere. See <http://www.unicode.org/unicode/reports/tr17/>:
>
> 0xD800 is incomplete in Unicode. Unless followed by another 16-bit
> value of the right form, it is illegal.
>
> (Unicode here should read UTF-16, off course. The reason it does not
> is that the language of the technical report has not been updated to
> that of 3.1)
If you would have left it at "Unicode" I would have felt
better ;-)
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/