[Python-Dev] 2.2 Unicode questions
Guido van Rossum
guido@digicool.com
Thu, 19 Jul 2001 08:10:08 -0400
> First, a short one, Mark Hammond's patch for supporting MBCS on
> Windows. I trust everyone can handle a little bit of TeX markup?
>
> % XXX is this explanation correct?
> \item When presented with a Unicode filename on Windows, Python will
> now correctly convert it to a string using the MBCS encoding.
> Filenames on Windows are a case where Python's choice of ASCII as
> the default encoding turns out to be an annoyance.
>
> This patch also adds \samp{et} as a format sequence to
> \cfunction{PyArg_ParseTuple}; \samp{et} takes both a parameter and
> an encoding name, and converts it to the given encoding if the
> parameter turns out to be a Unicode string, or leaves it alone if
> it's an 8-bit string, assuming it to already be in the desired
> encoding. (This differs from the \samp{es} format character, which
> assumes that 8-bit strings are in Python's default ASCII encoding
> and converts them to the specified new encoding.)
>
> (Contributed by Mark Hammond with assistance from Marc-Andr\'e
> Lemburg.)
I learned something here, so I hope this is correct. :-)
> Second, the --enable-unicode changes:
>
> %======================================================================
> \section{Unicode Changes}
>
> Python's Unicode support has been enhanced a bit in 2.2. Unicode
> strings are usually stored as UCS-2, as 16-bit unsigned integers.
> Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned
> integers, as its internal encoding by supplying
> \longprogramopt{enable-unicode=ucs4} to the configure script. When
> built to use UCS-4, in theory Python could handle Unicode characters
> from U-00000000 to U-7FFFFFFF.
I think the Unicode folks use U+, not U-, and the largest Unicode
chracter is "only" U+10FFFF. (Never mind that the data type can
handle larger values.)
> Being able to use UCS-4 internally is
> a necessary step to do that, but it's not the only step, and in Python
> 2.2alpha1 the work isn't complete yet. For example, the
> \function{unichr()} function still only accepts values from 0 to
> 65535,
Untrue: it supports range(0x110000) (in UCS-2 mode this returns a
surrogate pair). Now, maybe that's not what it *should* do...
> and there's no \code{\e U} notation for embedding characters
> greater than 65535 in a Unicode string literal.
Not true either -- correct \U has been part of Python since 2.0. It
does the same thing as unichr() described above.
> All this is the
> province of the still-unimplemented PEP 261, ``Support for `wide'
> Unicode characters''; consult it for further details, and please offer
> comments and suggestions on the proposal it describes.
>
> % ... section on decode() deleted; on firmer ground there...
>
> \method{encode()} and \method{decode()} were implemented by
> Marc-Andr\'e Lemburg. The changes to support using UCS-4 internally
> were implemented by Fredrik Lundh and Martin von L\"owis.
>
> \begin{seealso}
>
> \seepep{261}{Support for `wide' Unicode characters}{PEP written by
> Paul Prescod. Not yet accepted or fully implemented.}
>
> \end{seealso}
>
> Corrections? Thanks in advance...
If I were you, I would make sure that Marc-Andre and Martin agree
with me before adopting my comments above...
And thank *you* for doing this very useful write-up again! (I'm doing
my part by writing up the types/class unification thing -- now mostly
complete at http://www.python.org/2.2/descrintro.html.)
--Guido van Rossum (home page: http://www.python.org/~guido/)