[Python-3000] How will unicode get used?

Wed Sep 20 19:47:39 CEST 2006

On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Adam Olsen" <rhamph at gmail.com> wrote:
> > Before we can decide on the internal representation of our unicode
> > objects, we need to decide on their external interface.  My thoughts
> > so far:
>
> I believe the only options up for actual decision is what the internal
> representation of a unicode object will be.  Utf-8 that is never changed?
> Utf-8 that is converted to ucs-2/4 on certain kinds of accesses?
> Latin-1/ucs-2/ucs-4 depending on code point content?  Always ucs-2/4,
> depending on compiler switch?

Just a minor nit.  I doubt we could accept UCS-2, we'd want UTF-16
instead, with all the variable-width goodness that brings in.

Or maybe not so minor.  Old versions of windows used UCS-2, new
versions use UTF-16.  The former should get errors if too high of a
character is used, the latter will need conversion if we're not using
UTF-16.

> > * Most transformation and testing methods (.lower(), .islower(), etc)
> > can be copied directly from 2.x.  They require no special
> > implementation to perform reasonably.
>
> A decoding variant of these would be required if the underlying
> representation of a particular string is not latin-1, ucs-2, or ucs-4.

That makes no sense.  They can operate on any encoding we design them
to.  The cost is always O(n) with the length of the string.

> Further, any rstrip/split/etc. methods need to scan/parse the entire
> string in order to discover code point starts/ends when using a utf-*
> variant as an internal encoding (except for utf-32, which has a constant
> width per character).

See below.

> Whether or not we choose to go with a varying internal representation
> (the latin-1/ucs-2/ucs-4 variant I have been suggesting),
>
>
> > * Indexing and slicing is the big issue.  Do we need constant-time
> > integer slicing?  .find() could be changed to return a token that
> > could be used as a constant-time offset.  Incrementing the token would
> > have linear costs, but that's no big deal if the offsets are always
> > small.
>
> If by "constant-time integer slicing" you mean "find the start and end
> memory offsets of a slice in constant time", I would say yes.
>
> Generally, I think tokens (in unicode strings) are a waste of time and
> implementation.  Giving each string a fixed-width per character allows
> methods on those unicode strings to be far simpler in implementation.

s = 'foobar'
p = s[s.find('bar'):] == 'bar'

Even if .find() is made to return a token, rather than an integer, the
behavior and performance of this example are unchanged.

However, I can imagine there might be use cases, such as the .find()
output on one string being used to slice a different string, which
tokens wouldn't support.  I haven't been able to dream up any sane
examples, which is why I asked about it here.  I want to see specific
examples showing that tokens won't work.

Using only utf-8 would be simpler than three distinct representations.
 And if memory usage is an issue (which it seems to be, albeit in a
vague way), we could make a custom encoding that's even simpler and
more space efficient than utf-8.

> > * Grapheme clusters, words, lines, other groupings, do we need/want
> > ways to slice based on them too?
>
> No.

Can you explain your reasoning?

> > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> > want to support them?  Now would be the time.
>
> This would imply a tree-based string, which Guido has specifically
> stated would not happen.  Never mind that it would be a beast to
> implement and maintain or that it would exclude the possibility for
> offering the single-segment buffer interface, without reprocessing.

The only reference I found was this:
http://mail.python.org/pipermail/python-3000/2006-August/003334.html

I interpret that as him being very sceptical, not an outright refusal.

Allowing external code to operate on a python string in-place seems
tenuous at best.  Even with three types (Latin-1, UCS-2, UCS-4) you
would need to automatically copy and convert if the wrong type is
given.

-- 
Adam Olsen, aka Rhamphoryncus