[I18n-sig] Re: How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 22:43:55 +0200

Rick McGowan wrote:
> Gaute B Strokkenes wrote...
> > [I'm cc:-ing the unicode list to make sure that I've gotten my
> > terminology right, and to solicit comments
> Interesting... I just started looking at Python the other day, once I
> discovered it has such nice built-in Unicode support.
> If Python is explicitly storing the stuff as UTF-16 in u"" strings, then
> slicing operations certainly should be acting on units of the backing
> store, just as for ASCII "character" strings.  In that case, in order for
> every unit to be addressible, it should allow breaking up of surrogate
> pairs.  (Apple's Cocoa environment strings work the same way with
> "ranges".)  There should be another operation, or several, that slice up
> strings based on other kinds of text element boundaries.  For example, a
> "slice on character boundaries" that would always shift the range to
> accommodate surrogate pairs -- as a separate operation.
> The low-level routines in Python, like slicing with absolute locations,
> shouldn't presume to know about the encoding, only about the UNITS that are
> in the "array".

Exactly my opinion. 

Do you have references which we could look at
to determine which of these boundary kinds would actually be
useful in daily programming ?

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/