PEP 393 vs UTF-8 Everywhere

Fri Jan 20 20:00:58 EST 2017

On Sat, Jan 21, 2017 at 11:51 AM, Pete Forman <petef4+usenet at gmail.com> wrote:
> MRAB <python at mrabarnett.plus.com> writes:
>
>> As someone who has written an extension, I can tell you that I much
>> prefer dealing with a fixed number of bytes per codepoint than a
>> variable number of bytes per codepoint, especially as I'm also
>> supporting earlier versions of Python where that was the case.
>
> At the risk of sounding harsh, if supporting variable bytes per
> codepoint is a pain you should roll with it for the greater good of
> supporting users.

That hasn't been demonstrated, though. There's plenty of evidence
regarding cache usage that shows that direct indexing is incredibly
beneficial on large strings. What are the benefits of variable-sized
encodings? AFAIK, the only real benefit is that you can use less
memory for strings that contain predominantly ASCII but a small number
of astral characters (plus *maybe* a faster encode-to-UTF-8; you
wouldn't get a faster decode-from-UTF-8, because you still need to
check that the byte sequence is valid). Can you show a use-case that
would be materially improved by UTF-8?

ChrisA