[Python-3000] How will unicode get used?

Thu Sep 21 02:29:32 CEST 2006

"Adam Olsen" <rhamph at gmail.com> wrote:
> On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Adam Olsen" <rhamph at gmail.com> wrote:
> > >
> > > On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> > > >
> > > > "Adam Olsen" <rhamph at gmail.com> wrote:
> 
> [snip token stuff]
> 
> Withdrawn.  Blake Winston pointed me to some problems in private as well.
> 
> 
> > If I can't slice based on character index, then we end up with a similar
> > situation that the wxPython StyledTextCtrl runs into right now: the
> > content is encoded via utf-8 internally, so users have to use the fairly
> > annoying PositionBefore(pos) and PositionAfter(pos) methods to discover
> > where characters start/end.  While it is possible to handle everything
> > this way, it is *damn annoying*, and some users have gone so far as to
> > say that it *doesn't work* for Europeans.
> >
> > While I won't make the claim that it *doesn't work*, it is a pain in the
> > ass.
> 
> I'm going to agree with you.  That's also why I'm going to assume
> Guido meant to use Code Points, not Code Units (which would be bytes
> in the case of UTF-8).

> > > Using only utf-8 would be simpler than three distinct representations.
> > >  And if memory usage is an issue (which it seems to be, albeit in a
> > > vague way), we could make a custom encoding that's even simpler and
> > > more space efficient than utf-8.
> >
> > One of the reasons I've been pushing for the 3 representations is
> > because it is (arguably) optimal for any particular string.
> 
> It bothers me that adding a single character would cause it to double
> or quadruple in size.  May be the best compromise though.

Ahh, but the crucial observation is that the string would have been
two or four times as large initially.

> I'm not entierly convinced, but I'll leave it for now.  Maybe it'll be
> a 3.1 feature.

I'll just say, "you ain't gonna need it".  Why?  In my experience, I
rarely, if ever, say "give me the ith word" or "give me the ith line". 
I really do, "give me the first ..., and the remaing ...".  With
partition (with or without views), you can do these things quite easily.

> > The benefits gained by using the three internal representations are
> > primarily from a simplicity standpoint.  That is to say, when
> > manipulating any one of the three representations, you know that the
> > value at offset X represents the code point of character X in the string.
> >
> > Further, with a slight change in how the single-segment buffer interface
> > is defined (returns the width of the character), C extensions that want
> > to deal with unicode strings in *native* format (due to concerns about
> > speed), could do so without having to worry about reencoding,
> > variable-width characters, etc.
> 
> Is it really worthwhile if there's three different formats they'd have
> to handle?

It would depend, but any application that currently handles both utf-16
and UCS-4 builds of Python and unicode strings would require slight
modification to handle Latin-1, and could be simplified to handle UCS-2
instead of UTF-16.

> Indeed, it seems like all our options are lose-lose.
> 
> Just to summarize, our requirements are:
> * Full unicode range (0 through 0x10FFFF)
> * Constant-time slicing using integer offsets
> * Basic unit is a Code Point
> * Continuous in memory
> 
> The best idea I've had so far for making UTF-8 have constant-time
> sliving is to use a two level table, with the second level having one
> byte per code point.  However, that brings up the minimum size to
> (more than) 2 bytes per code point, ruining any space advantage that
> utf-8 had.

(I'm not advocating the following, just expressing that it could be done)

Another way of doing it would be to have the underlying string in
utf-8 (or even 16), but layer a specially crafted tree structure over
the top of it, sized in a particular manner so that (bytes used)/(bytes
required) is somewhat small.  It could offer log-time discovery of
offsets (for slicing) and memory-continuous representation. This tree
could also be generated after some k slices, avoiding the overhead of
tree creation unless we have determined it to be reasonable to ammortize.

If one chooses one node per klogn characters, then we get O(n) tree
construction time with the same big-O index discovery time, using ~24*n/
(klogn) additional bytes/string of length n (assuming a 64-bit Python).
Choose k=24 (or k=12 on a 32 bit Python), and we get the used/required
ratio of 1 + logn/n.  Not bad.

> I think the only viable options (without changing the requirements)
> are straight UCS-4 or three-way (Latin-1/UCS-2/UCS-4).  The size
> variability of three-way doesn't seem so important when it's only
> competitor is straight UCS-4.
> 
> The deciding factor is what we want to expose to third-party interfaces.
> 
> Sane interface (not bytes/code units), good efficiency, C-accessible: pick two.

I would say that both options are C-accessable, though perhaps not
optimally in either case.  Note that we can always recode for the
third-party interfaces; that's what is already done for PyGTK, wxPython
on linux, 32-bit characters on Windows, etc.

 - Josiah