[I18n-sig] Re: How does Python Unicode treat surrogates?
M.-A. Lemburg
mal@lemburg.com
Mon, 25 Jun 2001 22:43:55 +0200
Rick McGowan wrote:
>
> Gaute B Strokkenes wrote...
>
> > [I'm cc:-ing the unicode list to make sure that I've gotten my
> > terminology right, and to solicit comments
>
> Interesting... I just started looking at Python the other day, once I
> discovered it has such nice built-in Unicode support.
>
> If Python is explicitly storing the stuff as UTF-16 in u"" strings, then
> slicing operations certainly should be acting on units of the backing
> store, just as for ASCII "character" strings. In that case, in order for
> every unit to be addressible, it should allow breaking up of surrogate
> pairs. (Apple's Cocoa environment strings work the same way with
> "ranges".) There should be another operation, or several, that slice up
> strings based on other kinds of text element boundaries. For example, a
> "slice on character boundaries" that would always shift the range to
> accommodate surrogate pairs -- as a separate operation.
>
> The low-level routines in Python, like slicing with absolute locations,
> shouldn't presume to know about the encoding, only about the UNITS that are
> in the "array".
Exactly my opinion.
Do you have references which we could look at
to determine which of these boundary kinds would actually be
useful in daily programming ?
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/