[Python-3000] How will unicode get used?

Thu Sep 21 00:52:38 CEST 2006

On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Adam Olsen" <rhamph at gmail.com> wrote:
> >
> > On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> > >
> > > "Adam Olsen" <rhamph at gmail.com> wrote:

[snip token stuff]

Withdrawn.  Blake Winston pointed me to some problems in private as well.

> If I can't slice based on character index, then we end up with a similar
> situation that the wxPython StyledTextCtrl runs into right now: the
> content is encoded via utf-8 internally, so users have to use the fairly
> annoying PositionBefore(pos) and PositionAfter(pos) methods to discover
> where characters start/end.  While it is possible to handle everything
> this way, it is *damn annoying*, and some users have gone so far as to
> say that it *doesn't work* for Europeans.
>
> While I won't make the claim that it *doesn't work*, it is a pain in the
> ass.

I'm going to agree with you.  That's also why I'm going to assume
Guido meant to use Code Points, not Code Units (which would be bytes
in the case of UTF-8).

> > Using only utf-8 would be simpler than three distinct representations.
> >  And if memory usage is an issue (which it seems to be, albeit in a
> > vague way), we could make a custom encoding that's even simpler and
> > more space efficient than utf-8.
>
> One of the reasons I've been pushing for the 3 representations is
> because it is (arguably) optimal for any particular string.

It bothers me that adding a single character would cause it to double
or quadruple in size.  May be the best compromise though.

> > > > * Grapheme clusters, words, lines, other groupings, do we need/want
> > > > ways to slice based on them too?
> > >
> > > No.
> >
> > Can you explain your reasoning?
>
> We can already split based on words, lines, etc., usingsplit(), and
> re.split().  Building additional functionality for text.word[4] seems to
> be a waste of time.

I'm not entierly convinced, but I'll leave it for now.  Maybe it'll be
a 3.1 feature.

> The benefits gained by using the three internal representations are
> primarily from a simplicity standpoint.  That is to say, when
> manipulating any one of the three representations, you know that the
> value at offset X represents the code point of character X in the string.
>
> Further, with a slight change in how the single-segment buffer interface
> is defined (returns the width of the character), C extensions that want
> to deal with unicode strings in *native* format (due to concerns about
> speed), could do so without having to worry about reencoding,
> variable-width characters, etc.

Is it really worthwhile if there's three different formats they'd have
to handle?

> You can get this same behavior by always using UTF-32 (aka UCS-4), but
> at least 1/4 of the underlying data is always going to be nulls (code
> points are limited to 0x0010ffff), and for many people (in Europe, the
> US, and anywhere else with code points < 65536), 1/2 to 3/4 of the
> underlying data is going to be nulls.
>
> While I would imagine that people could deal with UTF-16 as an
> underlying representation (from a data waste perspective), the potential
> for varying-width characters in such an encoding is a pain in the ass
> (like it is for UTF-8).
>
> Regardless of our choice, *some platform* is going to be angry.  Why?
> GTK takes utf-8 encoded strings.  (I don't know what Qt or linux system
> calls take) Windows takes utf-16. Whatever underlying representation,
> *someone* is going to have to recode when dealing with GUI or OS-level
> operations.

Indeed, it seems like all our options are lose-lose.

Just to summarize, our requirements are:
* Full unicode range (0 through 0x10FFFF)
* Constant-time slicing using integer offsets
* Basic unit is a Code Point
* Continuous in memory

The best idea I've had so far for making UTF-8 have constant-time
sliving is to use a two level table, with the second level having one
byte per code point.  However, that brings up the minimum size to
(more than) 2 bytes per code point, ruining any space advantage that
utf-8 had.

UTF-16 is in the same boat, but it's (more than) 3 bytes per code point.

I think the only viable options (without changing the requirements)
are straight UCS-4 or three-way (Latin-1/UCS-2/UCS-4).  The size
variability of three-way doesn't seem so important when it's only
competitor is straight UCS-4.

The deciding factor is what we want to expose to third-party interfaces.

Sane interface (not bytes/code units), good efficiency, C-accessible: pick two.

-- 
Adam Olsen, aka Rhamphoryncus