[Python-3000] Making more effective use of slice objects in Py3k
Talin
talin at acm.org
Fri Sep 1 05:13:27 CEST 2006
Guido van Rossum wrote:
> On 8/31/06, Talin <talin at acm.org> wrote:
>> One way to handle this efficiently would be to only support the
>> encodings which have a constant character size: ASCII, Latin-1, UCS-2
>> and UTF-32. In other words, if the content of your text is plain ASCII,
>> use an 8-bit-per-character string; If the content is limited to the
>> Unicode BMF (Basic Multilingual Plane) use UCS-2; And if you are using
>> Unicode supplementary characters, use UTF-32.
>>
>> (The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes
>> per character, and doesn't support the supplemental characters above
>> 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.)
>
> I think we should also support UTF-16, since Java and .NET (and
> Win32?) appear to be using effectively; making surrogate handling an
> application issue doesn't seem *too* big of a burden for many apps.
I see that I misspoke - what I meant was, that we would "suppport" all
of the available encodings in the sense that we could translate string
objects to and from those encodings. But the internal representations of
the string objects themselves would only use those encodings which
represented a character in a fixed number of bytes.
Moreover, this internal representation should be opaque to users of the
string - if you want to write out a string as UTF-8 to a file, go for
it, it shouldn't matter what the internal type of the string is.
(Although Jython and IronPython should probably use whatever string
representation is defined by the underlying VM.)
>> By avoiding UTF-8, UTF-16 and other variable-character-length formats,
>> you can always insure that character index operations are done in
>> constant time. Index operations would simply require scaling the index
>> by the character size, rather than having to scan through the string and
>> count characters.
>>
>> The drawback of this method is that you may be forced to transform the
>> entire string into a wider encoding if you add a single character that
>> won't fit into the current encoding.
>
> A way to handle UTF-8 strings and other variable-length encodings
> would be to maintain a small cache of index positions with the string
> object.
Actually, I realized that this drawback isn't really much of an issue at
all. For virtually all string operations in Python, it is possible to
predict ahead of time what string width will be required - thus you can
allocated the proper width object up front, and not have to "widen" the
string in mid-operation.
So for example, any string operation which produces a subset of the
string (such as partition, split, index, slice, etc.) will produce a
string of the same width as the original string.
Any string operation that involves combining two strings will produce a
string that is the same type as the wider of the two strings. Thus, if I
say something like:
"Hello World" + chr( 0x8000 )
This will produce a 16-bits wide string, because 'chr( 0x8000 )' can't
be represented in ASCII, and thus produces a 16-bit-wide string. Since
the first string is plain ASCII (8 bits) and the second is 16 bits, the
result of the concatenation is a 16-bit string.
Similarly, transformations on strings such as upper / lower yield a
string that is the same width as the original.
The only case I can think of where you might need to "promote" an entire
string is where you are concatenating to a string buffer, in other words
you are dealing with a mutable string type. And this case is easily
handled by simply making the mutable string buffer type always use
UTF-32, and then narrowing the result when str() is called to the
narrowest possible representation that can hold the result.
So essentially what I am proposing is this:
-- That the Python 3000 "str" type can consist of 8-bit, 16-bit, or
32-bit characters, where all characters within a string are the same
number of bytes.
-- That all 3 types of strings appear identical to Python programmers,
such that they need not know what type of string they are using.
-- Any operation that returns a string result has the responsibility to
insure that the resulting string is wide enough to contain all of the
characters produced by the operation.
-- That string index operations will always be constant time, with no
auxiliary data structures required.
-- That all 3 string types can be converted into all of the available
encodings, including variable-character-width formats, however the
result is a "bytes" object, not a string.
An additional, but separate part of the proposal is that for str
objects, the contents of the string are always defined in terms of
Unicode code points. So if you want to convert to ISO-Latin-1, you can,
but the result is a bytes object, not a string. The advantage of this is
that it means that you always know what the value of 'ord()' is for a
given character. It also means that two strings can always be compared
for equality without having to decode them first.
>> (Another option is to simply make all strings UTF-32 -- which is not
>> that unreasonable, considering that text strings normally make up only a
>> small fraction of a program's memory footprint. I am sure that there are
>> applications that don't conform to this generalization, however. )
>
> Here you are effectively voting against polymorphic strings. I believe
> Fredrik has good reasons to doubt this assertion.
Yes, that is correct. I'm just throwing it out there as a possibility,
as it is by far the simplest solution. Its a question of trading memory
use for simplicity of implementation. Having a single, flat, internal
representation for all strings would be much less complex than having
different string types.
-- Talin
More information about the Python-3000
mailing list