[Python-3000] How will unicode get used?
Josiah Carlson
jcarlson at uci.edu
Wed Sep 20 23:59:22 CEST 2006
"Adam Olsen" <rhamph at gmail.com> wrote:
>
> On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Adam Olsen" <rhamph at gmail.com> wrote:
> > > Before we can decide on the internal representation of our unicode
> > > objects, we need to decide on their external interface. My thoughts
> > > so far:
> >
> > I believe the only options up for actual decision is what the internal
> > representation of a unicode object will be. Utf-8 that is never changed?
> > Utf-8 that is converted to ucs-2/4 on certain kinds of accesses?
> > Latin-1/ucs-2/ucs-4 depending on code point content? Always ucs-2/4,
> > depending on compiler switch?
>
> Just a minor nit. I doubt we could accept UCS-2, we'd want UTF-16
> instead, with all the variable-width goodness that brings in.
If we are opting for a *single* internal representation, then UTF-16 or
UTF-32 are really the only options.
> > > * Most transformation and testing methods (.lower(), .islower(), etc)
> > > can be copied directly from 2.x. They require no special
> > > implementation to perform reasonably.
> >
> > A decoding variant of these would be required if the underlying
> > representation of a particular string is not latin-1, ucs-2, or ucs-4.
>
> That makes no sense. They can operate on any encoding we design them
> to. The cost is always O(n) with the length of the string.
I was thinking .startswith() and .endswith(), but assuming *some*
canonical representation (UTF-16, UTF-32, etc.) this is trivial to
implement. I take back my concerns on this particular point.
> > Whether or not we choose to go with a varying internal representation
> > (the latin-1/ucs-2/ucs-4 variant I have been suggesting),
> >
> >
> > > * Indexing and slicing is the big issue. Do we need constant-time
> > > integer slicing? .find() could be changed to return a token that
> > > could be used as a constant-time offset. Incrementing the token would
> > > have linear costs, but that's no big deal if the offsets are always
> > > small.
> >
> > If by "constant-time integer slicing" you mean "find the start and end
> > memory offsets of a slice in constant time", I would say yes.
> >
> > Generally, I think tokens (in unicode strings) are a waste of time and
> > implementation. Giving each string a fixed-width per character allows
> > methods on those unicode strings to be far simpler in implementation.
>
> However, I can imagine there might be use cases, such as the .find()
> output on one string being used to slice a different string, which
> tokens wouldn't support. I haven't been able to dream up any sane
> examples, which is why I asked about it here. I want to see specific
> examples showing that tokens won't work.
p = s[6:-6]
Or even in actual code I use today:
p = s.lstrip()
lil = len(s) - len(p)
si = s[:lil]
lil += si.count('\t')*(self.GetTabWidth()-1)
#s is the original line
#p is the line without leading indentation
#si is the line indentation characters
#lil is the indentation of the line in columns
If I can't slice based on character index, then we end up with a similar
situation that the wxPython StyledTextCtrl runs into right now: the
content is encoded via utf-8 internally, so users have to use the fairly
annoying PositionBefore(pos) and PositionAfter(pos) methods to discover
where characters start/end. While it is possible to handle everything
this way, it is *damn annoying*, and some users have gone so far as to
say that it *doesn't work* for Europeans.
While I won't make the claim that it *doesn't work*, it is a pain in the
ass.
> Using only utf-8 would be simpler than three distinct representations.
> And if memory usage is an issue (which it seems to be, albeit in a
> vague way), we could make a custom encoding that's even simpler and
> more space efficient than utf-8.
One of the reasons I've been pushing for the 3 representations is
because it is (arguably) optimal for any particular string.
> > > * Grapheme clusters, words, lines, other groupings, do we need/want
> > > ways to slice based on them too?
> >
> > No.
>
> Can you explain your reasoning?
We can already split based on words, lines, etc., usingsplit(), and
re.split(). Building additional functionality for text.word[4] seems to
be a waste of time.
> > > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> > > want to support them? Now would be the time.
> >
> > This would imply a tree-based string, which Guido has specifically
> > stated would not happen. Never mind that it would be a beast to
> > implement and maintain or that it would exclude the possibility for
> > offering the single-segment buffer interface, without reprocessing.
>
> The only reference I found was this:
> http://mail.python.org/pipermail/python-3000/2006-August/003334.html
>
> I interpret that as him being very sceptical, not an outright refusal.
>
> Allowing external code to operate on a python string in-place seems
> tenuous at best. Even with three types (Latin-1, UCS-2, UCS-4) you
> would need to automatically copy and convert if the wrong type is
> given.
The only benefits that utf-8 gains over any other internal
representation is that it is an arguably minimal-sized representation,
and it is commonly used among other C libraries.
The benefits gained by using the three internal representations are
primarily from a simplicity standpoint. That is to say, when
manipulating any one of the three representations, you know that the
value at offset X represents the code point of character X in the string.
Further, with a slight change in how the single-segment buffer interface
is defined (returns the width of the character), C extensions that want
to deal with unicode strings in *native* format (due to concerns about
speed), could do so without having to worry about reencoding,
variable-width characters, etc.
You can get this same behavior by always using UTF-32 (aka UCS-4), but
at least 1/4 of the underlying data is always going to be nulls (code
points are limited to 0x0010ffff), and for many people (in Europe, the
US, and anywhere else with code points < 65536), 1/2 to 3/4 of the
underlying data is going to be nulls.
While I would imagine that people could deal with UTF-16 as an
underlying representation (from a data waste perspective), the potential
for varying-width characters in such an encoding is a pain in the ass
(like it is for UTF-8).
Regardless of our choice, *some platform* is going to be angry. Why?
GTK takes utf-8 encoded strings. (I don't know what Qt or linux system
calls take) Windows takes utf-16. Whatever underlying representation,
*someone* is going to have to recode when dealing with GUI or OS-level
operations.
- Josiah
More information about the Python-3000
mailing list