[Python-3000] How will unicode get used?

Mon Sep 25 16:33:26 CEST 2006

On 9/25/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> gabor <gabor at nekomancer.net> wrote:
> > Martin v. Löwis wrote:
> > > Gábor Farkas schrieb:

> > >> should he write his own slicing/whatever functions to get consistent
> > >> behaviour on linux/windows?

> > now, for this to behave correctly on non-bmp characters, i will need to
> > write a custom function, correct?

As David Hopwood pointed out, to be fully correct, you already have to
create a custom function even with bmp characters, because of
decomposed characters.  (Example:  Representing a c-cedilla as a c and
a combining cedilla, rather than as a single code point.)  Separating
those two would be wrong.  Counting them as two characters for slicing
purposes would usually be wrong.

Even 32-bit representations are permitted to use surrogate pairs; it
just doesn't often make sense.

These are problems inherent to unicode (or at least to non-normalized
unicode).  Different python implementations may expose the problem in
different places, but the problem is always there.

We *could* specify that slicing and indexing act as though the
underlying representation were normalized (and this would typically
require normalization as part of construction), but I'm not sure that
is the right answer.  Even if it were trivial, there are reasons not
to normalize.

> It is important, arguably one of the most important pieces.  But there
> are three parts; 1) code points not currently defined within the unicode
> spec, but who have specific encodings (based on the code point value), 2)
> in the case of UTF-16 representations, Python's handling of characters >
> 65535, 3) surrogates.

> I believe #1 is handled "correctly" today, Martin sounds like he wants
> #2 fixed for Py3k (I don't believe anyone *doesn't* want it fixed), and
> #3 could be fixed while fixing #2 with a little more work (if desired).

You also left out (4), decomposed characters, which is a more complex
version of surrogates.

Guido just stated that #2 is intentional,  though he didn't pronounce
that it should stay that way.  There are sound arguments both ways.
In particular, fixing it without fixing decomposed characters might
incur the cost without the benefit.

-jJ