[Python-3000] string C API
Nick Coghlan
ncoghlan at gmail.com
Sat Sep 16 05:46:36 CEST 2006
Antoine Pitrou wrote:
> Le vendredi 15 septembre 2006 à 10:48 -0700, Josiah Carlson a écrit :
>> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:
>
> You could replace "latin-1" with "one-byte system encoding chosen at
> interpreter startup depending on locale".
> There are lots of 8-bit encodings other than iso-8859-1.
> (for example, my current locale uses iso-8859-15)
The choice of latin-1 is deliberate and non-arbitrary. The reason for the
choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255:
>>> x = range(256)
>>> xs = ''.join(map(chr, x))
>>> xu = xs.decode('latin-1')
>>> all(ord(s)==ord(u) for s, u in zip(xs, xu))
True
In effect, when creating the string, you would be doing something like this:
if encoding == 'latin-1':
bytes_per_char = 1
code_points = 8_bit_data
else:
code_points, max_code_point = decode_to_UCS4(8_bit_data, encoding)
if max_code_point < 256:
bytes_per_char = 1
elif max_code_point < 65536:
bytes_per_char = 2
else:
bytes_per_char = 4
# A width argument to the bytes constructor would be very convenient
# for being able to consistently deal with endianness issues
self.internal_buffer = bytes(code_points, width=bytes_per_char)
self.bytes_per_char = bytes_per_char
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
---------------------------------------------------------------
http://www.boredomandlaziness.org
More information about the Python-3000
mailing list