Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

Steven D'Aprano steve+comp.lang.python at
Fri Mar 29 01:39:57 CET 2013

On Thu, 28 Mar 2013 10:11:59 -0600, Ian Kelly wrote:

> On Thu, Mar 28, 2013 at 8:38 AM, Chris Angelico <rosuav at>
> wrote:
>> PEP393 strings have two optimizations, or kinda three:
>> 1a) ASCII-only strings
>> 1b) Latin1-only strings
>> 2) BMP-only strings
>> 3) Everything else
>> Options 1a and 1b are almost identical - I'm not sure what the detail
>> is, but there's something flagging those strings that fit inside seven
>> bits. (Something to do with optimizing encodings later?) Both are
>> optimized down to a single byte per character.
> The only difference for ASCII-only strings is that they are kept in a
> struct with a smaller header.  The smaller header omits the utf8 pointer
> (which optionally points to an additional UTF-8 representation of the
> string) and its associated length variable.  These are not needed for
> ASCII-only strings because an ASCII string can be directly interpreted
> as a UTF-8 string for the same result.  The smaller header also omits
> the "wstr_length" field which, according to the PEP, "differs from
> length only if there are surrogate pairs in the representation."  For an
> ASCII string, of course there would not be any surrogate pairs.

I wonder why they need care about surrogate pairs? 

ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only 
strings. It's only strings in the SMPs that could need surrogate pairs, 
and they don't need them in Python's implementation since it's a full 32-
bit implementation. So where do the surrogate pairs come into this?

I also wonder why the implementation bothers keeping a UTF-8 
representation. That sounds like premature optimization to me. Surely you 
only need it when writing to a file with UTF-8 encoding? For most 
strings, that will never happen.


More information about the Python-list mailing list