[Python-Dev] 2.2 Unicode questions
Andrew Kuchling
akuchlin@mems-exchange.org
Wed, 18 Jul 2001 21:55:46 -0400
I've written some text on Unicode for the 2.2 article, but it's
doubtful I actually understand what's going on. Can people who
actually understand where Unicode has been please take a look at the
following?
First, a short one, Mark Hammond's patch for supporting MBCS on
Windows. I trust everyone can handle a little bit of TeX markup?
% XXX is this explanation correct?
\item When presented with a Unicode filename on Windows, Python will
now correctly convert it to a string using the MBCS encoding.
Filenames on Windows are a case where Python's choice of ASCII as
the default encoding turns out to be an annoyance.
This patch also adds \samp{et} as a format sequence to
\cfunction{PyArg_ParseTuple}; \samp{et} takes both a parameter and
an encoding name, and converts it to the given encoding if the
parameter turns out to be a Unicode string, or leaves it alone if
it's an 8-bit string, assuming it to already be in the desired
encoding. (This differs from the \samp{es} format character, which
assumes that 8-bit strings are in Python's default ASCII encoding
and converts them to the specified new encoding.)
(Contributed by Mark Hammond with assistance from Marc-Andr\'e
Lemburg.)
Second, the --enable-unicode changes:
%======================================================================
\section{Unicode Changes}
Python's Unicode support has been enhanced a bit in 2.2. Unicode
strings are usually stored as UCS-2, as 16-bit unsigned integers.
Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned
integers, as its internal encoding by supplying
\longprogramopt{enable-unicode=ucs4} to the configure script. When
built to use UCS-4, in theory Python could handle Unicode characters
from U-00000000 to U-7FFFFFFF. Being able to use UCS-4 internally is
a necessary step to do that, but it's not the only step, and in Python
2.2alpha1 the work isn't complete yet. For example, the
\function{unichr()} function still only accepts values from 0 to
65535, and there's no \code{\e U} notation for embedding characters
greater than 65535 in a Unicode string literal. All this is the
province of the still-unimplemented PEP 261, ``Support for `wide'
Unicode characters''; consult it for further details, and please offer
comments and suggestions on the proposal it describes.
% ... section on decode() deleted; on firmer ground there...
\method{encode()} and \method{decode()} were implemented by
Marc-Andr\'e Lemburg. The changes to support using UCS-4 internally
were implemented by Fredrik Lundh and Martin von L\"owis.
\begin{seealso}
\seepep{261}{Support for `wide' Unicode characters}{PEP written by
Paul Prescod. Not yet accepted or fully implemented.}
\end{seealso}
Corrections? Thanks in advance...
--amk