[Python-3000] string C API
Nick Coghlan
ncoghlan at gmail.com
Sat Sep 16 18:49:36 CEST 2006
Martin v. Löwis wrote:
> Nick Coghlan schrieb:
>> The choice of latin-1 is deliberate and non-arbitrary. The reason for the
>> choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255:
>
> That's true, but that this makes a good choice for a special case
> doesn't follow. Instead, frequency of occurrence of the special case
> makes it a good choice.
If an 8-bit encoding other than latin-1 is used for the internal buffer, then
every comparison operation would have to decode the string to Unicode in order
to compare code points.
It seems much simpler to me to ensure that what is stored internally is
*always* the Unicode code points, with the width (1, 2 or 4 bytes) determined
by the largest code point in the string. The latter two are the UCS-2 and
UCS-4 formats that are compile-time selectable for unicode strings in Python
2.x, but I'm not aware of any name other than 'latin-1' for the case where all
of the code points are less than 256.
> Hardly. Instead, the codec would have to create the string of the right
> width; a codec written in C would make two passes, rather than
> temporarily allocating memory to actually represent the UCS-4 codes.
Indeed, that does make more sense - one pass to figure out the number of
characters and the largest code point, and a second to copy the characters to
the allocated buffer.
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
---------------------------------------------------------------
http://www.boredomandlaziness.org
More information about the Python-3000
mailing list