[Python-3000] How will unicode get used?
rhamph at gmail.com
Wed Sep 20 19:47:39 CEST 2006
On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> "Adam Olsen" <rhamph at gmail.com> wrote:
> > Before we can decide on the internal representation of our unicode
> > objects, we need to decide on their external interface. My thoughts
> > so far:
> I believe the only options up for actual decision is what the internal
> representation of a unicode object will be. Utf-8 that is never changed?
> Utf-8 that is converted to ucs-2/4 on certain kinds of accesses?
> Latin-1/ucs-2/ucs-4 depending on code point content? Always ucs-2/4,
> depending on compiler switch?
Just a minor nit. I doubt we could accept UCS-2, we'd want UTF-16
instead, with all the variable-width goodness that brings in.
Or maybe not so minor. Old versions of windows used UCS-2, new
versions use UTF-16. The former should get errors if too high of a
character is used, the latter will need conversion if we're not using
> > * Most transformation and testing methods (.lower(), .islower(), etc)
> > can be copied directly from 2.x. They require no special
> > implementation to perform reasonably.
> A decoding variant of these would be required if the underlying
> representation of a particular string is not latin-1, ucs-2, or ucs-4.
That makes no sense. They can operate on any encoding we design them
to. The cost is always O(n) with the length of the string.
> Further, any rstrip/split/etc. methods need to scan/parse the entire
> string in order to discover code point starts/ends when using a utf-*
> variant as an internal encoding (except for utf-32, which has a constant
> width per character).
> Whether or not we choose to go with a varying internal representation
> (the latin-1/ucs-2/ucs-4 variant I have been suggesting),
> > * Indexing and slicing is the big issue. Do we need constant-time
> > integer slicing? .find() could be changed to return a token that
> > could be used as a constant-time offset. Incrementing the token would
> > have linear costs, but that's no big deal if the offsets are always
> > small.
> If by "constant-time integer slicing" you mean "find the start and end
> memory offsets of a slice in constant time", I would say yes.
> Generally, I think tokens (in unicode strings) are a waste of time and
> implementation. Giving each string a fixed-width per character allows
> methods on those unicode strings to be far simpler in implementation.
s = 'foobar'
p = s[s.find('bar'):] == 'bar'
Even if .find() is made to return a token, rather than an integer, the
behavior and performance of this example are unchanged.
However, I can imagine there might be use cases, such as the .find()
output on one string being used to slice a different string, which
tokens wouldn't support. I haven't been able to dream up any sane
examples, which is why I asked about it here. I want to see specific
examples showing that tokens won't work.
Using only utf-8 would be simpler than three distinct representations.
And if memory usage is an issue (which it seems to be, albeit in a
vague way), we could make a custom encoding that's even simpler and
more space efficient than utf-8.
> > * Grapheme clusters, words, lines, other groupings, do we need/want
> > ways to slice based on them too?
Can you explain your reasoning?
> > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> > want to support them? Now would be the time.
> This would imply a tree-based string, which Guido has specifically
> stated would not happen. Never mind that it would be a beast to
> implement and maintain or that it would exclude the possibility for
> offering the single-segment buffer interface, without reprocessing.
The only reference I found was this:
I interpret that as him being very sceptical, not an outright refusal.
Allowing external code to operate on a python string in-place seems
tenuous at best. Even with three types (Latin-1, UCS-2, UCS-4) you
would need to automatically copy and convert if the wrong type is
Adam Olsen, aka Rhamphoryncus
More information about the Python-3000