[Python-Dev] 2.2 Unicode questions

M.-A. Lemburg mal@lemburg.com
Thu, 19 Jul 2001 15:05:55 +0200


Guido van Rossum wrote:
> 
> > First, a short one, Mark Hammond's patch for supporting MBCS on
> > Windows.  I trust everyone can handle a little bit of TeX markup?
> >
> >   % XXX is this explanation correct?
> >   \item When presented with a Unicode filename on Windows, Python will
> >   now correctly convert it to a string using the MBCS encoding.
> >   Filenames on Windows are a case where Python's choice of ASCII as
> >   the default encoding turns out to be an annoyance.
> >
> >   This patch also adds \samp{et} as a format sequence to
> >   \cfunction{PyArg_ParseTuple}; \samp{et} takes both a parameter and
> >   an encoding name, and converts it to the given encoding if the
> >   parameter turns out to be a Unicode string, or leaves it alone if
> >   it's an 8-bit string, assuming it to already be in the desired
> >   encoding.  (This differs from the \samp{es} format character, which
> >   assumes that 8-bit strings are in Python's default ASCII encoding
> >   and converts them to the specified new encoding.)
> >
> >   (Contributed by Mark Hammond with assistance from Marc-Andr\'e
> >   Lemburg.)
> 
> I learned something here, so I hope this is correct. :-)

The last part is... the rest is for Mark to comment on.
 
> > Second, the --enable-unicode changes:
> >
> > %======================================================================
> > \section{Unicode Changes}
> >
> > Python's Unicode support has been enhanced a bit in 2.2.  Unicode
> > strings are usually stored as UCS-2, as 16-bit unsigned integers.
> > Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned
> > integers, as its internal encoding by supplying
> > \longprogramopt{enable-unicode=ucs4} to the configure script.  When
> > built to use UCS-4, in theory Python could handle Unicode characters
> > from U-00000000 to U-7FFFFFFF.
> 
> I think the Unicode folks use U+, not U-, 

True.

> and the largest Unicode
> chracter is "only" U+10FFFF.  (Never mind that the data type can
> handle larger values.)

I wouldn't count on that...  (note that Andrew wrote "could" ;-)
 
> > Being able to use UCS-4 internally is
> > a necessary step to do that, but it's not the only step, and in Python
> > 2.2alpha1 the work isn't complete yet.  For example, the
> > \function{unichr()} function still only accepts values from 0 to
> > 65535,
> 
> Untrue: it supports range(0x110000) (in UCS-2 mode this returns a
> surrogate pair).  Now, maybe that's not what it *should* do...

It should definitely not, unless you want to break code which assumes
that chr() and unichr() always return a single byte/code unit ! 

This was part of the UCS-4 checkins which hadn't had time yet to 
review. Should I remove the surrogate part for narrow builds ?
 
> > and there's no \code{\e U} notation for embedding characters
> > greater than 65535 in a Unicode string literal.
> 
> Not true either -- correct \U has been part of Python since 2.0.  It
> does the same thing as unichr() described above.

Right.

Note that in this case, the handling of surrogates is needed
to make the unicode-escape encoding roundtrip safe.
 
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/