[Python-Dev] 2.2 Unicode questions
M.-A. Lemburg
mal@lemburg.com
Fri, 20 Jul 2001 18:39:30 +0200
>From Andrew's new pass:
"""
Python's Unicode support has been enhanced a bit in 2.2. Unicode
strings are usually stored as UTF-16, as 16-bit unsigned integers.
"""
Please replace UTF-16 with UCS-2. Python's Unicode implementation
does not support UTF-16 in a surrogate aware way, only some
of the codecs do this.
As a result, the internal storage format of Python is more
precisely described with UCS-2.
"""
Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned
integers, as its internal encoding by supplying
\longprogramopt{enable-unicode=ucs4} to the configure script. When
built to use UCS-4 (a ``wide Python''), the interpreter can natively
handle Unicode characters from U+000000 to U+110000. The range of
legal values for the \function{unichr()} function has been expanded;
it used to only accept values up to 65535, but in 2.2 will accept
values from 0 to 0x110000. Using a ``narrow Python'', an interpreter
compiled to use UTF-16, values greater than 65535 will result in
\function{unichr()} returning a string of length 2:
\begin{verbatim}
>>> s = unichr(65536)
>>> s
u'\ud800\udc00'
>>> len(s)
2
\end{verbatim}
"""
Same here: UTF-16 -> UCS-2. Note that I very much favour
removing the surrogate generation in unichr() for UCS2-builds.
If I don't here strong opposition, I'll disable this feature
which was added as part of the UCS-4 patches. unichr()
will then raise an exception as it did in version 2.1.
"""
This possibly-confusing behaviour, breaking the intuitive invariant
that \function{chr()} and\function{unichr()} always return strings of
length 1, may be changed later in 2.2, depending on public reaction.
"""
Right.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/