Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]
steve+comp.lang.python at pearwood.info
Fri Mar 29 01:39:57 CET 2013
On Thu, 28 Mar 2013 10:11:59 -0600, Ian Kelly wrote:
> On Thu, Mar 28, 2013 at 8:38 AM, Chris Angelico <rosuav at gmail.com>
>> PEP393 strings have two optimizations, or kinda three:
>> 1a) ASCII-only strings
>> 1b) Latin1-only strings
>> 2) BMP-only strings
>> 3) Everything else
>> Options 1a and 1b are almost identical - I'm not sure what the detail
>> is, but there's something flagging those strings that fit inside seven
>> bits. (Something to do with optimizing encodings later?) Both are
>> optimized down to a single byte per character.
> The only difference for ASCII-only strings is that they are kept in a
> struct with a smaller header. The smaller header omits the utf8 pointer
> (which optionally points to an additional UTF-8 representation of the
> string) and its associated length variable. These are not needed for
> ASCII-only strings because an ASCII string can be directly interpreted
> as a UTF-8 string for the same result. The smaller header also omits
> the "wstr_length" field which, according to the PEP, "differs from
> length only if there are surrogate pairs in the representation." For an
> ASCII string, of course there would not be any surrogate pairs.
I wonder why they need care about surrogate pairs?
ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
strings. It's only strings in the SMPs that could need surrogate pairs,
and they don't need them in Python's implementation since it's a full 32-
bit implementation. So where do the surrogate pairs come into this?
I also wonder why the implementation bothers keeping a UTF-8
representation. That sounds like premature optimization to me. Surely you
only need it when writing to a file with UTF-8 encoding? For most
strings, that will never happen.
More information about the Python-list