[Python-3000] How will unicode get used?

Sat Sep 23 20:03:43 CEST 2006

"Martin v. Löwis" <martin at v.loewis.de> wrote:
> David Hopwood schrieb:
[snip]
> > Should we nevertheless try to avoid making the use of Unicode strings
> > unnecessarily difficult for people who have minimal knowledge of Unicode?
> > Absolutely, but not at the expense of making basic operations on strings
> > asymptotically less efficient. O(1) indexing and slicing is a basic
> > requirement, even if it has to be done using code units.
> 
> It's not possible to implement slicing in constant time, unless string
> views are introduced. Currently, slicing takes time linear with the
> length of the result string.

I believe he was referring to discovering the memory address where
slicing should begin.  In the case of Latin-1, UCS-2, or UCS-4, given a
starting address and some position i, it is trivial to discover the
memory position of character i.  In the case of UTF-8, given a starting
address and some position i, one needs to somewhat parse the UTF-8
representation to discover the memory position of character i.

For me, having recently remembered what was in a unicode string, and
verifying it by checking the source, the question in my mind is whether
we want to stick with the same 2-representation implementation (default
encoding and UTF-16 or UCS-4 depending on build), or go with more or
fewer representations.

We can reduce memory consumption by using a single representation,
whether it be constant or variable based on content, though in some
cases (utf-16, ucs-4) we would lose the 'native' single-segment char (C
char) buffer interface.

Using multiple representations, and choosing those representations
carefully based on platform (always keep utf-8 as one of the
representations on linux, always keep utf-16 as one of the
representations in Windows), we may be able to increase platform API
calling speed, if such is desireable.

After re-reading the source, and thinking a bit more, about my only
real concern is memory use of Python 3.x .  The current implementation
works, so I'm +1 on keeping it "as is", but I'm also +0 on some
implementation that would reduce memory use (with limited, if any
slowdown) for as many platforms as possible, not any higher because
changing the underlying implementation would be a PITA.

 - Josiah