[I18n-sig] Re: How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 22:28:45 +0200

Mark Davis wrote:
> That is an interesting approach; one that basically amounts to some
> convenience functions. For example, instead of writing:
> myString.substring(myString.cpToIndex(3), myString.cpToIndex(5));
> you could write:
> myString.substring(3, 5, myString.CODEPOINT);
> This hides some of the work, when someone is working in code points. The
> performance cost is still there, of course; using code point indexes
> requires each operation to examine every code unit up to that point, which
> is much more expensive.

Good idea !
> For a general programming language or string library, I'm not sure about
> implementing this pattern throughout. I know in the ICU library, for
> example, we have a significant number of functions that take offsets into
> strings. Having such a parameter on all of them would be clumsy, when most
> of the time people are simply working in code units.

In Python this would certainly be an elegant way to add the
code point indexing functionality (Python supports optional arguments
with default values).
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/