[Python-3000] How will unicode get used?

Wed Sep 20 19:09:14 CEST 2006

On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:

> "Adam Olsen" <rhamph at gmail.com> wrote:
> > Before we can decide on the internal representation of our unicode
> > objects, we need to decide on their external interface.  My thoughts
> > so far:

> I believe the only options up for actual decision is what the internal
> representation of a unicode object will be.

If I request string[4:7], what (format of string) will come back?

The same format as the original string?
A canonical format?
The narrowest possible for the new string?

When a recoding occurs, is that in addition to the original format, or
instead of?  (I think "in addition" would be useful, as we're likely
to need that original format back for output -- but it does waste
space when we don't need the original again.)

> Further, any rstrip/split/etc. methods need to scan/parse the entire
> string in order to discover code point starts/ends when using a utf-*
> variant as an internal encoding (except for utf-32, which has a constant
> width per character).

No.  That is true of some encodings, but not the UTF variants.  A byte
(or double-byte, for UTF-16) is unambiguous.

Within a specific encoding, each possible (byte or double-byte) value
represents at most one of

    a complete value
    the start of a multi-position value
    the continuation of a multi-position value

That said, string[47:-34] may need to parse the whole string, just to
count double-position characters.  (To be honest, I'm not sure even
then; for UTF-16 it might make sense to treat surrogates as
double-width characters.  Even for UTF-8, there might be a workaround
that speeds up the majority of strings.)

> Giving each string a fixed-width per character allows
> methods on those unicode strings to be far simpler in implementation.

Which is why that was done in Py 2K.  The question for Py3K is

    Should we *commit* to this particular representation and allow
direct access to the internals?

    Or should we treat the internals as opaque, and allow more
efficient representations if someone wants to write one.

Today, I can go ahead and write my own string representation, but if I
change the internal storage, I can't actually use it with most
compiled extensions.

> > * Grapheme clusters, words, lines, other groupings, do we need/want
> > ways to slice based on them too?

> No.

I assume that you don't really mean strings will stop supporting split()

> > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> > want to support them?  Now would be the time.

> This would imply a tree-based string,

Cheap slicing wouldn't.
Cheap concatenation in *all* cases would.
Cheap concatenation in a few lucky cases wouldn't.

> it would exclude the possibility for
> offering the single-segment buffer interface, without reprocessing.

I'm not sure exactly what you mean here.  If you just mean "C code
can't get at the internals without warning", then that is true.

It is also true that any function requesting the internals would need
to either get the encoding along with it, or work with bytes.

If the C code wants that buffer in a specific encoding, it will have
to request that, which might well require reprocessing.  But if so,
then this recoding already happens today -- it is just that today, we
do it for every string, instead of only the ones that need it.  (But
today, the recoding happens earlier, which can be better for
debugging.)

-jJ