[Python-Dev] New Py_UNICODE doc
Nicholas Bastin
nbastin at opnet.com
Fri May 6 20:49:00 CEST 2005
On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote:
> You've got that wrong: Python let's you choose UCS-4 -
> UCS-2 is the default.
>
> Note that Python's Unicode codecs UTF-8 and UTF-16
> are surrogate aware and thus support non-BMP code points
> regardless of the build type: A UCS2-build of Python will
> store a non-BMP code point as UTF-16 surrogate pair in the
> Py_UNICODE buffer while a UCS4 build will store it as a
> single value. Decoding is surrogate aware too, so a UTF-16
> surrogate pair in a UCS2 build will get treated as single
> Unicode code point.
If this is the case, then we're clearly misleading users. If the
configure script says UCS-2, then as a user I would assume that
surrogate pairs would *not* be encoded, because I chose UCS-2, and it
doesn't support that. I would assume that any UTF-16 string I would
read would be transcoded into the internal type (UCS-2), and
information would be lost. If this is not the case, then what does the
configure option mean?
--
Nick
More information about the Python-Dev
mailing list