[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 18:00:42 CEST 2011

Nick Coghlan, 24.08.2011 15:06:
> On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
>> In utf16.py, attached to http://bugs.python.org/issue12729
>> I propose for consideration a prototype of different solution to the 'mostly
>> BMP chars, few non-BMP chars' case. Rather than expand every character from
>> 2 bytes to 4, attach an array cpdex of character (ie code point, not code
>> unit) indexes. Then for indexing and slicing, the correction is simple,
>> simpler than I first expected:
>>   code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
>> where code-unit-index is the adjusted index into the full underlying
>> double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids
>> most of the space penalty and the consequent time penalty of moving more
>> bytes around and increasing cache misses.
>
> Interesting idea, but putting on my C programmer hat, I say -1.
>
> Non-uniform cell size = not a C array = standard C array manipulation
> idioms don't work = pain (no matter how simple the index correction
> happens to be).
>
> The nice thing about PEP 383 is that it gives us the smallest storage
> array that is both an ordinary C array and has sufficiently large
> individual elements to handle every character in the string.

+1

Stefan