python-unicode doesn't support >65535 symbols?

Rainer Deyke rainerd at eldwood.com
Thu Nov 27 13:36:00 EST 2003


Andrew Clover wrote:
> gabor <gabor at z10n.net> wrote:
>
>> so text[3] (which should be \U00010330),
>> was split to 2 16bit values (text[3] and text[4]).
>
> The default encoding for native Unicode strings in Python in UTF-16,
> which cannot hold the extended planes beyond 0xFFFF in a single
> character.

That's not quite right.  UTF-16 encodes unicode characters as either single
16 bit values and pairs of 16 bit values.  However, one character is still
one character.

Python makes the mistake of exposing the internal representation instead of
the logical value of unicode objects.  This means that, aside from space
optimization, unicode objects have no advantage over UTF-8 encoded plain
strings for storing unicode text.


-- 
Rainer Deyke - rainerd at eldwood.com - http://eldwood.com






More information about the Python-list mailing list