[Python-3000] How will unicode get used?
jcarlson at uci.edu
Wed Sep 20 17:50:25 CEST 2006
"Adam Olsen" <rhamph at gmail.com> wrote:
> Before we can decide on the internal representation of our unicode
> objects, we need to decide on their external interface. My thoughts
> so far:
I believe the only options up for actual decision is what the internal
representation of a unicode object will be. Utf-8 that is never changed?
Utf-8 that is converted to ucs-2/4 on certain kinds of accesses?
Latin-1/ucs-2/ucs-4 depending on code point content? Always ucs-2/4,
depending on compiler switch?
> * Most transformation and testing methods (.lower(), .islower(), etc)
> can be copied directly from 2.x. They require no special
> implementation to perform reasonably.
A decoding variant of these would be required if the underlying
representation of a particular string is not latin-1, ucs-2, or ucs-4.
Further, any rstrip/split/etc. methods need to scan/parse the entire
string in order to discover code point starts/ends when using a utf-*
variant as an internal encoding (except for utf-32, which has a constant
width per character).
Whether or not we choose to go with a varying internal representation
(the latin-1/ucs-2/ucs-4 variant I have been suggesting),
> * Indexing and slicing is the big issue. Do we need constant-time
> integer slicing? .find() could be changed to return a token that
> could be used as a constant-time offset. Incrementing the token would
> have linear costs, but that's no big deal if the offsets are always
If by "constant-time integer slicing" you mean "find the start and end
memory offsets of a slice in constant time", I would say yes.
Generally, I think tokens (in unicode strings) are a waste of time and
implementation. Giving each string a fixed-width per character allows
methods on those unicode strings to be far simpler in implementation.
> * Grapheme clusters, words, lines, other groupings, do we need/want
> ways to slice based on them too?
> * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> want to support them? Now would be the time.
This would imply a tree-based string, which Guido has specifically
stated would not happen. Never mind that it would be a beast to
implement and maintain or that it would exclude the possibility for
offering the single-segment buffer interface, without reprocessing.
More information about the Python-3000