[I18n-sig] Re: How does Python Unicode treat surrogates?

Mark Davis mark@macchiato.com
Mon, 25 Jun 2001 12:27:07 -0700

That is an interesting approach; one that basically amounts to some
convenience functions. For example, instead of writing:

myString.substring(myString.cpToIndex(3), myString.cpToIndex(5));

you could write:

myString.substring(3, 5, myString.CODEPOINT);

This hides some of the work, when someone is working in code points. The
performance cost is still there, of course; using code point indexes
requires each operation to examine every code unit up to that point, which
is much more expensive.

For a general programming language or string library, I'm not sure about
implementing this pattern throughout. I know in the ICU library, for
example, we have a significant number of functions that take offsets into
strings. Having such a parameter on all of them would be clumsy, when most
of the time people are simply working in code units.


----- Original Message -----
From: "J M Sykes" <mike.sykes@acm.org>
To: "Mark Davis" <mark@macchiato.com>; "M.-A. Lemburg" <mal@lemburg.com>;
"Gaute B Strokkenes" <gs234@cam.ac.uk>
Cc: "Tim Peters" <tim.one@home.com>; <i18n-sig@python.org>; "Unicode List"
Sent: Monday, June 25, 2001 10:38
Subject: Re: How does Python Unicode treat surrogates?

> Mark Davis said:
> >
> > In most people's experience, it is best to leave the low level
> > with indices in terms of code units, then supply some utility routines
> that
> > tell you information about code points. ...
> Anyone on the list interested in the treatment of UCS aka Unicode in
> programming languages might like to know that a meeting of ISO/IEC JTC
> 32/WG 3 recently approved a paper that specifies how SQL implementations
> should do it.
> The proposal can be found at:
> The current CD of the next SQL standard (ISO/IEC 9075), as amended by this
> proposal (and many others) can be found at:
> 01-06.pdf
> Briefly, the SQL functions CHARACTER_LENGTH, POSITION (the SQL string
> indexing function), and SUBSTRING will all accept a parameter specifying
> units to be used, the alternatives being OCTETS, CODE_UNITS and CHARACTERS
> (which to SQL means code points); the default being characters.
> This proposal was agreed by major SQL implementors.
> Which doesn't mean that it's right, nor that it can't be changed. But
> how it is at the moment.
> Mike.
> ***********************************************************
> J M Sykes              Email: Mike.Sykes@acm.org
> 97 Oakdale Drive
> Heald Green
> Cheshire   SK8 3SN
> UK                        Tel: (44) 161 437 5413
> ***********************************************************