[Python-3000] string C API

Sat Sep 16 05:46:36 CEST 2006

Antoine Pitrou wrote:
> Le vendredi 15 septembre 2006 à 10:48 -0700, Josiah Carlson a écrit :
>> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:
> 
> You could replace "latin-1" with "one-byte system encoding chosen at
> interpreter startup depending on locale".
> There are lots of 8-bit encodings other than iso-8859-1.
> (for example, my current locale uses iso-8859-15)

The choice of latin-1 is deliberate and non-arbitrary. The reason for the 
choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255:

 >>> x = range(256)
 >>> xs = ''.join(map(chr, x))
 >>> xu = xs.decode('latin-1')
 >>> all(ord(s)==ord(u) for s, u in zip(xs, xu))
True

In effect, when creating the string, you would be doing something like this:

   if encoding == 'latin-1':
       bytes_per_char = 1
       code_points = 8_bit_data
   else:
       code_points, max_code_point = decode_to_UCS4(8_bit_data, encoding)
       if max_code_point < 256:
           bytes_per_char = 1
       elif max_code_point < 65536:
           bytes_per_char = 2
       else:
           bytes_per_char = 4
   # A width argument to the bytes constructor would be very convenient
   # for being able to consistently deal with endianness issues
   self.internal_buffer = bytes(code_points, width=bytes_per_char)
   self.bytes_per_char = bytes_per_char

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org