[Python-Dev] New Py_UNICODE doc
Nicholas Bastin
nbastin at opnet.com
Sat May 7 02:01:50 CEST 2005
On May 6, 2005, at 7:43 PM, Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>> If this is the case, then we're clearly misleading users. If the
>> configure script says UCS-2, then as a user I would assume that
>> surrogate pairs would *not* be encoded, because I chose UCS-2, and it
>> doesn't support that.
>
> What do you mean by that? That the interpreter crashes if you try
> to store a low surrogate into a Py_UNICODE?
What I mean is pretty clear. UCS-2 does *NOT* support surrogate pairs.
If it did, it would be called UTF-16. If Python really supported
UCS-2, then surrogate pairs from UTF-16 inputs would either get turned
into two garbage characters, or the "I couldn't transcode this" UCS-2
code point (I don't remember which on that is off the top of my head).
>> I would assume that any UTF-16 string I would
>> read would be transcoded into the internal type (UCS-2), and
>> information
>> would be lost. If this is not the case, then what does the configure
>> option mean?
>
> It tells you whether you have the two-octet form of the Universal
> Character Set, or the four-octet form.
It would, if that were the case, but it's not. Setting UCS-2 in the
configure script really means UTF-16, and as such, the documentation
should reflect that.
--
Nick
More information about the Python-Dev
mailing list