python-unicode doesn't support >65535 symbols?
and-google at doxdesk.com
Thu Nov 27 18:46:51 CET 2003
gabor <gabor at z10n.net> wrote:
> so text (which should be \U00010330),
> was split to 2 16bit values (text and text).
The default encoding for native Unicode strings in Python in UTF-16, which
cannot hold the extended planes beyond 0xFFFF in a single character. Instead,
it uses two 'surrogate' characters. Bit of a nasty hack, but that's what
Unicode does and there's nothing can be done about it now.
Python can be compiled to use UCS-4 for native Unicode strings if you prefer.
Then every conceptual 'character' from the Unicode repertoire will be one
item long. It'll eat more memory too of course.
> if tthe representation of 'text' is correct, why is the length wrong?
The representation of 'text' you are seeing is just the nicely-readable
version output by Python 2.2+. Despite the \U sequence, it is actually still
stored internally as two UTF-16 surrogates. You'll see this if you enter
'\U00012345' into Python 2.0 or 2.1, which don't use the \U form to output
mailto:and at doxdesk.com
More information about the Python-list