[Python-3000] How will unicode get used?
rhamph at gmail.com
Thu Sep 21 00:52:38 CEST 2006
On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> "Adam Olsen" <rhamph at gmail.com> wrote:
> > On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> > >
> > > "Adam Olsen" <rhamph at gmail.com> wrote:
[snip token stuff]
Withdrawn. Blake Winston pointed me to some problems in private as well.
> If I can't slice based on character index, then we end up with a similar
> situation that the wxPython StyledTextCtrl runs into right now: the
> content is encoded via utf-8 internally, so users have to use the fairly
> annoying PositionBefore(pos) and PositionAfter(pos) methods to discover
> where characters start/end. While it is possible to handle everything
> this way, it is *damn annoying*, and some users have gone so far as to
> say that it *doesn't work* for Europeans.
> While I won't make the claim that it *doesn't work*, it is a pain in the
I'm going to agree with you. That's also why I'm going to assume
Guido meant to use Code Points, not Code Units (which would be bytes
in the case of UTF-8).
> > Using only utf-8 would be simpler than three distinct representations.
> > And if memory usage is an issue (which it seems to be, albeit in a
> > vague way), we could make a custom encoding that's even simpler and
> > more space efficient than utf-8.
> One of the reasons I've been pushing for the 3 representations is
> because it is (arguably) optimal for any particular string.
It bothers me that adding a single character would cause it to double
or quadruple in size. May be the best compromise though.
> > > > * Grapheme clusters, words, lines, other groupings, do we need/want
> > > > ways to slice based on them too?
> > >
> > > No.
> > Can you explain your reasoning?
> We can already split based on words, lines, etc., usingsplit(), and
> re.split(). Building additional functionality for text.word seems to
> be a waste of time.
I'm not entierly convinced, but I'll leave it for now. Maybe it'll be
a 3.1 feature.
> The benefits gained by using the three internal representations are
> primarily from a simplicity standpoint. That is to say, when
> manipulating any one of the three representations, you know that the
> value at offset X represents the code point of character X in the string.
> Further, with a slight change in how the single-segment buffer interface
> is defined (returns the width of the character), C extensions that want
> to deal with unicode strings in *native* format (due to concerns about
> speed), could do so without having to worry about reencoding,
> variable-width characters, etc.
Is it really worthwhile if there's three different formats they'd have
> You can get this same behavior by always using UTF-32 (aka UCS-4), but
> at least 1/4 of the underlying data is always going to be nulls (code
> points are limited to 0x0010ffff), and for many people (in Europe, the
> US, and anywhere else with code points < 65536), 1/2 to 3/4 of the
> underlying data is going to be nulls.
> While I would imagine that people could deal with UTF-16 as an
> underlying representation (from a data waste perspective), the potential
> for varying-width characters in such an encoding is a pain in the ass
> (like it is for UTF-8).
> Regardless of our choice, *some platform* is going to be angry. Why?
> GTK takes utf-8 encoded strings. (I don't know what Qt or linux system
> calls take) Windows takes utf-16. Whatever underlying representation,
> *someone* is going to have to recode when dealing with GUI or OS-level
Indeed, it seems like all our options are lose-lose.
Just to summarize, our requirements are:
* Full unicode range (0 through 0x10FFFF)
* Constant-time slicing using integer offsets
* Basic unit is a Code Point
* Continuous in memory
The best idea I've had so far for making UTF-8 have constant-time
sliving is to use a two level table, with the second level having one
byte per code point. However, that brings up the minimum size to
(more than) 2 bytes per code point, ruining any space advantage that
UTF-16 is in the same boat, but it's (more than) 3 bytes per code point.
I think the only viable options (without changing the requirements)
are straight UCS-4 or three-way (Latin-1/UCS-2/UCS-4). The size
variability of three-way doesn't seem so important when it's only
competitor is straight UCS-4.
The deciding factor is what we want to expose to third-party interfaces.
Sane interface (not bytes/code units), good efficiency, C-accessible: pick two.
Adam Olsen, aka Rhamphoryncus
More information about the Python-3000