[Python-3000] How will unicode get used?

Wed Sep 20 23:20:22 CEST 2006

"Jim Jewett" <jimjjewett at gmail.com> wrote:
> On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> 
> > "Adam Olsen" <rhamph at gmail.com> wrote:
> > > Before we can decide on the internal representation of our unicode
> > > objects, we need to decide on their external interface.  My thoughts
> > > so far:
> 
> > I believe the only options up for actual decision is what the internal
> > representation of a unicode object will be.
> 
> If I request string[4:7], what (format of string) will come back?
> 
> The same format as the original string?
> A canonical format?
> The narrowest possible for the new string?

Which of the three depend on the choice of internal representation.  If
the internal representation is always canonical, narrowest, or same as
the original string, then it would be one of those.

> When a recoding occurs, is that in addition to the original format, or
> instead of?  (I think "in addition" would be useful, as we're likely
> to need that original format back for output -- but it does waste
> space when we don't need the original again.)

The current implementation, I believe, uses "in addition", unless I'm
misreading the unicode string struct.

> > Further, any rstrip/split/etc. methods need to scan/parse the entire
> > string in order to discover code point starts/ends when using a utf-*
> > variant as an internal encoding (except for utf-32, which has a constant
> > width per character).
> 
> No.  That is true of some encodings, but not the UTF variants.  A byte
> (or double-byte, for UTF-16) is unambiguous.

I was under the impression that utf-8 was a particular kind of prefix
encoding.  Looking at the actual output of utf-8, I notice that the
encodings are such that bytes with value >= 0xc0 define the beginning of
the multi-character encodings, so handling 'from the front' or 'from the
back' are equivalently as reasonable.

> That said, string[47:-34] may need to parse the whole string, just to
> count double-position characters.  (To be honest, I'm not sure even
> then; for UTF-16 it might make sense to treat surrogates as
> double-width characters.  Even for UTF-8, there might be a workaround
> that speeds up the majority of strings.)

It would involve keeping some sort of cache of indices/offset values. 
This may not be worthwhile.

> > Giving each string a fixed-width per character allows
> > methods on those unicode strings to be far simpler in implementation.
> 
> Which is why that was done in Py 2K.  The question for Py3K is
> 
>     Should we *commit* to this particular representation and allow
> direct access to the internals?

Why not?

>     Or should we treat the internals as opaque, and allow more
> efficient representations if someone wants to write one.

I'm not sure that the efficiencies are necessarily desireable.

> Today, I can go ahead and write my own string representation, but if I
> change the internal storage, I can't actually use it with most
> compiled extensions.

Right, but extensions that are used *right now* would need to be
rewritten to handle these "more efficient" representations.

> > > * Grapheme clusters, words, lines, other groupings, do we need/want
> > > ways to slice based on them too?
> 
> > No.
> 
> I assume that you don't really mean strings will stop supporting split()

That would be silly.  What I meant was that text.word[7], text.line[3],
etc., shouldn't mean anything on the base implementation.

> > > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> > > want to support them?  Now would be the time.
> 
> > This would imply a tree-based string,
> 
> Cheap slicing wouldn't.

O(logn) would imply a tree-based string.  O(1) would imply slicing on
text returning views (which I'm not even advocating, and I'm a view
proponent).

> Cheap concatenation in *all* cases would.
> Cheap concatenation in a few lucky cases wouldn't.

Presumably one would need to copy data from one to the other, so that
would O(n) with a non-tree version.

> > it would exclude the possibility for
> > offering the single-segment buffer interface, without reprocessing.
> 
> I'm not sure exactly what you mean here.  If you just mean "C code
> can't get at the internals without warning", then that is true.

The single-segment buffer interface is, not uncommonly, how C extensions
get at the content of strings, unicode, array, mmap, etc.  Technically
speaking, the current implementation of str and unicode use an internal
variant to gain access to their own internals for processing.

> It is also true that any function requesting the internals would need
> to either get the encoding along with it, or work with bytes.

Or code points...  The point of specifying the character width as 1,2 or
4 bytes, would be that one can iterate over chars, shorts, or ints.

> If the C code wants that buffer in a specific encoding, it will have
> to request that, which might well require reprocessing.  But if so,
> then this recoding already happens today -- it is just that today, we
> do it for every string, instead of only the ones that need it.  (But
> today, the recoding happens earlier, which can be better for
> debugging.)

Indeed.  But it's not just for C extensions, it's for Python's own
string/unicode internals.  Simple is better than complex.  Having a flat
array-based implementation is simple, and allows us to re-use the vast
majority of code we already have.

 - Josiah