[Python-3000] How will unicode get used?

Sat Sep 23 21:17:04 CEST 2006

Josiah Carlson schrieb:
> For me, having recently remembered what was in a unicode string, and
> verifying it by checking the source, the question in my mind is whether
> we want to stick with the same 2-representation implementation (default
> encoding and UTF-16 or UCS-4 depending on build), or go with more or
> fewer representations.

I would personally like to see a Python API that operates on code
points, with support for 17 planes. I also think that efficient indexing
is important.

> We can reduce memory consumption by using a single representation,
> whether it be constant or variable based on content, though in some
> cases (utf-16, ucs-4) we would lose the 'native' single-segment char (C
> char) buffer interface.

I don't think reducing memory consumption is that important, for current
hardware. Java and .NET have demonstrated that you can do "real"
application with that approach.

There are trade-offs, of course. I personally think the best trade-off
would be to have a two-byte representation, along with a flag telling
whether there are any surrogate pairs in the string. Indexing and
length would be constant-time if there are no surrogates, and linear
time if there are.

> After re-reading the source, and thinking a bit more, about my only
> real concern is memory use of Python 3.x .  The current implementation
> works, so I'm +1 on keeping it "as is", but I'm also +0 on some
> implementation that would reduce memory use (with limited, if any
> slowdown) for as many platforms as possible, not any higher because
> changing the underlying implementation would be a PITA.

I think supporting multiple representations at run-time would really
be terrible. Any API of the "give me the data" kind would either have
to expose the choice of representations, or perform a copy. Either
alternative would produce many programming errors in extension modules.

Regards,
Martin