[Python-3000] Making more effective use of slice objects in Py3k
Guido van Rossum
guido at python.org
Thu Aug 31 20:55:15 CEST 2006
On 8/31/06, Talin <talin at acm.org> wrote:
> One way to handle this efficiently would be to only support the
> encodings which have a constant character size: ASCII, Latin-1, UCS-2
> and UTF-32. In other words, if the content of your text is plain ASCII,
> use an 8-bit-per-character string; If the content is limited to the
> Unicode BMF (Basic Multilingual Plane) use UCS-2; And if you are using
> Unicode supplementary characters, use UTF-32.
> (The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes
> per character, and doesn't support the supplemental characters above
> 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.)
I think we should also support UTF-16, since Java and .NET (and
Win32?) appear to be using effectively; making surrogate handling an
application issue doesn't seem *too* big of a burden for many apps.
> By avoiding UTF-8, UTF-16 and other variable-character-length formats,
> you can always insure that character index operations are done in
> constant time. Index operations would simply require scaling the index
> by the character size, rather than having to scan through the string and
> count characters.
> The drawback of this method is that you may be forced to transform the
> entire string into a wider encoding if you add a single character that
> won't fit into the current encoding.
A way to handle UTF-8 strings and other variable-length encodings
would be to maintain a small cache of index positions with the string
> (Another option is to simply make all strings UTF-32 -- which is not
> that unreasonable, considering that text strings normally make up only a
> small fraction of a program's memory footprint. I am sure that there are
> applications that don't conform to this generalization, however. )
Here you are effectively voting against polymorphic strings. I believe
Fredrik has good reasons to doubt this assertion.
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-3000