[Python-Dev] 2.2 Unicode questions
M.-A. Lemburg
mal@lemburg.com
Thu, 19 Jul 2001 15:05:55 +0200
Guido van Rossum wrote:
>
> > First, a short one, Mark Hammond's patch for supporting MBCS on
> > Windows. I trust everyone can handle a little bit of TeX markup?
> >
> > % XXX is this explanation correct?
> > \item When presented with a Unicode filename on Windows, Python will
> > now correctly convert it to a string using the MBCS encoding.
> > Filenames on Windows are a case where Python's choice of ASCII as
> > the default encoding turns out to be an annoyance.
> >
> > This patch also adds \samp{et} as a format sequence to
> > \cfunction{PyArg_ParseTuple}; \samp{et} takes both a parameter and
> > an encoding name, and converts it to the given encoding if the
> > parameter turns out to be a Unicode string, or leaves it alone if
> > it's an 8-bit string, assuming it to already be in the desired
> > encoding. (This differs from the \samp{es} format character, which
> > assumes that 8-bit strings are in Python's default ASCII encoding
> > and converts them to the specified new encoding.)
> >
> > (Contributed by Mark Hammond with assistance from Marc-Andr\'e
> > Lemburg.)
>
> I learned something here, so I hope this is correct. :-)
The last part is... the rest is for Mark to comment on.
> > Second, the --enable-unicode changes:
> >
> > %======================================================================
> > \section{Unicode Changes}
> >
> > Python's Unicode support has been enhanced a bit in 2.2. Unicode
> > strings are usually stored as UCS-2, as 16-bit unsigned integers.
> > Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned
> > integers, as its internal encoding by supplying
> > \longprogramopt{enable-unicode=ucs4} to the configure script. When
> > built to use UCS-4, in theory Python could handle Unicode characters
> > from U-00000000 to U-7FFFFFFF.
>
> I think the Unicode folks use U+, not U-,
True.
> and the largest Unicode
> chracter is "only" U+10FFFF. (Never mind that the data type can
> handle larger values.)
I wouldn't count on that... (note that Andrew wrote "could" ;-)
> > Being able to use UCS-4 internally is
> > a necessary step to do that, but it's not the only step, and in Python
> > 2.2alpha1 the work isn't complete yet. For example, the
> > \function{unichr()} function still only accepts values from 0 to
> > 65535,
>
> Untrue: it supports range(0x110000) (in UCS-2 mode this returns a
> surrogate pair). Now, maybe that's not what it *should* do...
It should definitely not, unless you want to break code which assumes
that chr() and unichr() always return a single byte/code unit !
This was part of the UCS-4 checkins which hadn't had time yet to
review. Should I remove the surrogate part for narrow builds ?
> > and there's no \code{\e U} notation for embedding characters
> > greater than 65535 in a Unicode string literal.
>
> Not true either -- correct \U has been part of Python since 2.0. It
> does the same thing as unichr() described above.
Right.
Note that in this case, the handling of surrogates is needed
to make the unicode-escape encoding roundtrip safe.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/