[Python-3000] Making more effective use of slice objects in Py3k
Talin
talin at acm.org
Thu Aug 31 20:46:13 CEST 2006
Jack Diederich wrote:
>>> (in other words, I'm convinced that we need a polymorphic string type. I'm not
>>> so sure we need views, but if we have the former, we can use that mechanism to
>>> support the latter)
>> +1 for polymorphic strings.
>>
>> This would give us the best of both worlds: compact representations
>> for ASCII and Latin-1, full 32-bit text when needed, and the
>> possibility to implement further optimizations when necessary. It
>> could add a bit of complexity and/or a massive speed penalty
>> (depending on how naive the implementation is) around character
>> operations though.
>>
>> For implementation ideas, Apple's CoreFoundation has a mature
>> implementation of polymorphic strings in C (which is the basis for
>> their NSString type in Objective-C), and there's a cross-platform
>> subset of it available as CF-Lite:
>> http://developer.apple.com/opensource/cflite.html
>>
>
> Having watched Fredrik casually double the speed of many str and unicode
> operations in a week I'm easily +1 on whatever he says. Bob's support
> makes that a +2, he struck me as quite sane too.
One way to handle this efficiently would be to only support the
encodings which have a constant character size: ASCII, Latin-1, UCS-2
and UTF-32. In other words, if the content of your text is plain ASCII,
use an 8-bit-per-character string; If the content is limited to the
Unicode BMF (Basic Multilingual Plane) use UCS-2; And if you are using
Unicode supplementary characters, use UTF-32.
(The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes
per character, and doesn't support the supplemental characters above
0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.)
By avoiding UTF-8, UTF-16 and other variable-character-length formats,
you can always insure that character index operations are done in
constant time. Index operations would simply require scaling the index
by the character size, rather than having to scan through the string and
count characters.
The drawback of this method is that you may be forced to transform the
entire string into a wider encoding if you add a single character that
won't fit into the current encoding.
(Another option is to simply make all strings UTF-32 -- which is not
that unreasonable, considering that text strings normally make up only a
small fraction of a program's memory footprint. I am sure that there are
applications that don't conform to this generalization, however. )
-- Talin
More information about the Python-3000
mailing list