flaming vs accuracy [was Re: Performance of int/long in Python 3]

Ian Kelly ian.g.kelly at gmail.com
Thu Mar 28 17:11:59 CET 2013

On Thu, Mar 28, 2013 at 8:38 AM, Chris Angelico <rosuav at gmail.com> wrote:
> PEP393 strings have two optimizations, or kinda three:
> 1a) ASCII-only strings
> 1b) Latin1-only strings
> 2) BMP-only strings
> 3) Everything else
> Options 1a and 1b are almost identical - I'm not sure what the detail
> is, but there's something flagging those strings that fit inside seven
> bits. (Something to do with optimizing encodings later?) Both are
> optimized down to a single byte per character.

The only difference for ASCII-only strings is that they are kept in a
struct with a smaller header.  The smaller header omits the utf8
pointer (which optionally points to an additional UTF-8 representation
of the string) and its associated length variable.  These are not
needed for ASCII-only strings because an ASCII string can be directly
interpreted as a UTF-8 string for the same result.  The smaller header
also omits the "wstr_length" field which, according to the PEP,
"differs from length only if there are surrogate pairs in the
representation."  For an ASCII string, of course there would not be
any surrogate pairs.

More information about the Python-list mailing list