[Python-Dev] 2.2 Unicode questions

Thu, 19 Jul 2001 08:10:08 -0400

> First, a short one, Mark Hammond's patch for supporting MBCS on
> Windows.  I trust everyone can handle a little bit of TeX markup?
> 
>   % XXX is this explanation correct?  
>   \item When presented with a Unicode filename on Windows, Python will
>   now correctly convert it to a string using the MBCS encoding.
>   Filenames on Windows are a case where Python's choice of ASCII as
>   the default encoding turns out to be an annoyance.  
> 
>   This patch also adds \samp{et} as a format sequence to
>   \cfunction{PyArg_ParseTuple}; \samp{et} takes both a parameter and
>   an encoding name, and converts it to the given encoding if the
>   parameter turns out to be a Unicode string, or leaves it alone if
>   it's an 8-bit string, assuming it to already be in the desired
>   encoding.  (This differs from the \samp{es} format character, which
>   assumes that 8-bit strings are in Python's default ASCII encoding
>   and converts them to the specified new encoding.)
>    
>   (Contributed by Mark Hammond with assistance from Marc-Andr\'e
>   Lemburg.)

I learned something here, so I hope this is correct. :-)

> Second, the --enable-unicode changes:
> 
> %======================================================================
> \section{Unicode Changes}
> 
> Python's Unicode support has been enhanced a bit in 2.2.  Unicode
> strings are usually stored as UCS-2, as 16-bit unsigned integers.
> Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned
> integers, as its internal encoding by supplying
> \longprogramopt{enable-unicode=ucs4} to the configure script.  When
> built to use UCS-4, in theory Python could handle Unicode characters
> from U-00000000 to U-7FFFFFFF.

I think the Unicode folks use U+, not U-, and the largest Unicode
chracter is "only" U+10FFFF.  (Never mind that the data type can
handle larger values.)

> Being able to use UCS-4 internally is
> a necessary step to do that, but it's not the only step, and in Python
> 2.2alpha1 the work isn't complete yet.  For example, the
> \function{unichr()} function still only accepts values from 0 to
> 65535,

Untrue: it supports range(0x110000) (in UCS-2 mode this returns a
surrogate pair).  Now, maybe that's not what it *should* do...

> and there's no \code{\e U} notation for embedding characters
> greater than 65535 in a Unicode string literal.

Not true either -- correct \U has been part of Python since 2.0.  It
does the same thing as unichr() described above.

> All this is the
> province of the still-unimplemented PEP 261, ``Support for `wide'
> Unicode characters''; consult it for further details, and please offer
> comments and suggestions on the proposal it describes.
> 
> % ... section on decode() deleted; on firmer ground there...
> 
> \method{encode()} and \method{decode()} were implemented by
> Marc-Andr\'e Lemburg.  The changes to support using UCS-4 internally
> were implemented by Fredrik Lundh and Martin von L\"owis.
> 
> \begin{seealso}
> 
> \seepep{261}{Support for `wide' Unicode characters}{PEP written by
> Paul Prescod.  Not yet accepted or fully implemented.}
> 
> \end{seealso}
> 
> Corrections?  Thanks in advance...

If I were you, I would make sure that Marc-Andre and Martin agree
with me before adopting my comments above...

And thank *you* for doing this very useful write-up again!  (I'm doing
my part by writing up the types/class unification thing -- now mostly
complete at http://www.python.org/2.2/descrintro.html.)

--Guido van Rossum (home page: http://www.python.org/~guido/)