[Python-3000] Making more effective use of slice objects in Py3k

Thu Aug 31 20:46:13 CEST 2006

Jack Diederich wrote:
>>> (in other words, I'm convinced that we need a polymorphic string type.  I'm not
>>> so sure we need views, but if we have the former, we can use that mechanism to
>>> support the latter)
>> +1 for polymorphic strings.
>>
>> This would give us the best of both worlds: compact representations
>> for ASCII and Latin-1, full 32-bit text when needed, and the
>> possibility to implement further optimizations when necessary. It
>> could add a bit of complexity and/or a massive speed penalty
>> (depending on how naive the implementation is) around character
>> operations though.
>>
>> For implementation ideas, Apple's CoreFoundation has a mature
>> implementation of polymorphic strings in C (which is the basis for
>> their NSString type in Objective-C), and there's a cross-platform
>> subset of it available as CF-Lite:
>> http://developer.apple.com/opensource/cflite.html
>>
> 
> Having watched Fredrik casually double the speed of many str and unicode 
> operations in a week I'm easily +1 on whatever he says.  Bob's support 
> makes that a +2, he struck me as quite sane too.

One way to handle this efficiently would be to only support the 
encodings which have a constant character size: ASCII, Latin-1, UCS-2 
and UTF-32. In other words, if the content of your text is plain ASCII, 
use an 8-bit-per-character string; If the content is limited to the 
Unicode BMF (Basic Multilingual Plane) use UCS-2; And if you are using 
Unicode supplementary characters, use UTF-32.

(The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes 
per character, and doesn't support the supplemental characters above 
0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.)

By avoiding UTF-8, UTF-16 and other variable-character-length formats, 
you can always insure that character index operations are done in 
constant time. Index operations would simply require scaling the index 
by the character size, rather than having to scan through the string and 
count characters.

The drawback of this method is that you may be forced to transform the 
entire string into a wider encoding if you add a single character that 
won't fit into the current encoding.

(Another option is to simply make all strings UTF-32 -- which is not 
that unreasonable, considering that text strings normally make up only a 
small fraction of a program's memory footprint. I am sure that there are 
applications that don't conform to this generalization, however. )

-- Talin