[Python-3000] Making more effective use of slice objects in Py3k

Fri Sep 1 05:13:27 CEST 2006

Guido van Rossum wrote:
> On 8/31/06, Talin <talin at acm.org> wrote:
>> One way to handle this efficiently would be to only support the
>> encodings which have a constant character size: ASCII, Latin-1, UCS-2
>> and UTF-32. In other words, if the content of your text is plain ASCII,
>> use an 8-bit-per-character string; If the content is limited to the
>> Unicode BMF (Basic Multilingual Plane) use UCS-2; And if you are using
>> Unicode supplementary characters, use UTF-32.
>>
>> (The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes
>> per character, and doesn't support the supplemental characters above
>> 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.)
> 
> I think we should also support UTF-16, since Java and .NET (and
> Win32?) appear to be using effectively; making surrogate handling an
> application issue doesn't seem *too* big of a burden for many apps.

I see that I misspoke - what I meant was, that we would "suppport" all 
of the available encodings in the sense that we could translate string 
objects to and from those encodings. But the internal representations of 
the string objects themselves would only use those encodings which 
represented a character in a fixed number of bytes.

Moreover, this internal representation should be opaque to users of the 
string - if you want to write out a string as UTF-8 to a file, go for 
it, it shouldn't matter what the internal type of the string is.

(Although Jython and IronPython should probably use whatever string 
representation is defined by the underlying VM.)

>> By avoiding UTF-8, UTF-16 and other variable-character-length formats,
>> you can always insure that character index operations are done in
>> constant time. Index operations would simply require scaling the index
>> by the character size, rather than having to scan through the string and
>> count characters.
>>
>> The drawback of this method is that you may be forced to transform the
>> entire string into a wider encoding if you add a single character that
>> won't fit into the current encoding.
> 
> A way to handle UTF-8 strings and other variable-length encodings
> would be to maintain a small cache of index positions with the string
> object.

Actually, I realized that this drawback isn't really much of an issue at 
all. For virtually all string operations in Python, it is possible to 
predict ahead of time what string width will be required - thus you can 
allocated the proper width object up front, and not have to "widen" the 
string in mid-operation.

So for example, any string operation which produces a subset of the 
string (such as partition, split, index, slice, etc.) will produce a 
string of the same width as the original string.

Any string operation that involves combining two strings will produce a 
string that is the same type as the wider of the two strings. Thus, if I 
say something like:

    "Hello World" + chr( 0x8000 )

This will produce a 16-bits wide string, because 'chr( 0x8000 )' can't 
be represented in ASCII, and thus produces a 16-bit-wide string. Since 
the first string is plain ASCII (8 bits) and the second is 16 bits, the 
result of the concatenation is a 16-bit string.

Similarly, transformations on strings such as upper / lower yield a 
string that is the same width as the original.

The only case I can think of where you might need to "promote" an entire 
string is where you are concatenating to a string buffer, in other words 
you are dealing with a mutable string type. And this case is easily 
handled by simply making the mutable string buffer type always use 
UTF-32, and then narrowing the result when str() is called to the 
narrowest possible representation that can hold the result.

So essentially what I am proposing is this:

-- That the Python 3000 "str" type can consist of 8-bit, 16-bit, or 
32-bit characters, where all characters within a string are the same 
number of bytes.

-- That all 3 types of strings appear identical to Python programmers, 
such that they need not know what type of string they are using.

-- Any operation that returns a string result has the responsibility to 
insure that the resulting string is wide enough to contain all of the 
characters produced by the operation.

-- That string index operations will always be constant time, with no 
auxiliary data structures required.

-- That all 3 string types can be converted into all of the available 
encodings, including variable-character-width formats, however the 
result is a "bytes" object, not a string.

An additional, but separate part of the proposal is that for str 
objects, the contents of the string are always defined in terms of 
Unicode code points. So if you want to convert to ISO-Latin-1, you can, 
but the result is a bytes object, not a string. The advantage of this is 
that it means that you always know what the value of 'ord()' is for a 
given character. It also means that two strings can always be compared 
for equality without having to decode them first.

>> (Another option is to simply make all strings UTF-32 -- which is not
>> that unreasonable, considering that text strings normally make up only a
>> small fraction of a program's memory footprint. I am sure that there are
>> applications that don't conform to this generalization, however. )
> 
> Here you are effectively voting against polymorphic strings. I believe
> Fredrik has good reasons to doubt this assertion.

Yes, that is correct. I'm just throwing it out there as a possibility, 
as it is by far the simplest solution. Its a question of trading memory 
use for simplicity of implementation. Having a single, flat, internal 
representation for all strings would be much less complex than having 
different string types.

-- Talin