[Python-3000] string C API

Sat Sep 16 15:49:29 CEST 2006

Nick Coghlan schrieb:
> The choice of latin-1 is deliberate and non-arbitrary. The reason for the 
> choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255:

That's true, but that this makes a good choice for a special case
doesn't follow. Instead, frequency of occurrence of the special case
makes it a good choice.

> In effect, when creating the string, you would be doing something like this:
> 
>    if encoding == 'latin-1':
>        bytes_per_char = 1
>        code_points = 8_bit_data
>    else:
>        code_points, max_code_point = decode_to_UCS4(8_bit_data, encoding)
>        if max_code_point < 256:
>            bytes_per_char = 1
>        elif max_code_point < 65536:
>            bytes_per_char = 2
>        else:
>            bytes_per_char = 4

Hardly. Instead, the codec would have to create the string of the right
width; a codec written in C would make two passes, rather than
temporarily allocating memory to actually represent the UCS-4 codes.

Regards,
Martin