PEP 393 vs UTF-8 Everywhere

Sat Jan 21 08:56:03 EST 2017

On Sat, 21 Jan 2017 09:35 am, Pete Forman wrote:

> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation?

I've read over the PEP, and the email discussion, and there is very little
mention of UTF-8, and as far as I can see no counter-proposal for using
UTF-8. However, there are a few mentions of UTF-8 that suggest that the
participants were aware of it as an alternative, and simply didn't think it
was worth considering. I don't know why.

You can read the PEP and the mailing list discussion here:

The PEP:

https://www.python.org/dev/peps/pep-0393/

Mailing list discussion starts here:

https://mail.python.org/pipermail/python-dev/2011-January/107641.html

Stefan Behnel (author of Cython) states that UTF-8 is much harder to use:

https://mail.python.org/pipermail/python-dev/2011-January/107739.html

I see nobody challenging that claim, so perhaps there was simply enough
broad agreement that UTF-8 would have been more work and so nobody wanted
to propose it. I'm just guessing though.

Perhaps it would have been too big a change to adapt the CPython internals
to variable-width UTF-8 from the existing fixed-width UTF-16 and UTF-32
implementations?

(I know that UTF-16 is actually variable-width, but Python prior to PEP 393
treated it as if it were fixed.)

There was a much earlier discussion about the internal implementation of
Unicode strings:

https://mail.python.org/pipermail/python-3000/2006-September/003795.html

including some discussion of UTF-8:

https://mail.python.org/pipermail/python-3000/2006-September/003816.html

It too proposed using a three-way internal implementation, and made it clear
that O(1) indexing was an requirement.

Here's a comment explicitly pointing out that constant-time indexing is
wanted, and that using UTF-8 with a two-level table destroys any space
advantage UTF-8 might have:

https://mail.python.org/pipermail/python-3000/2006-September/003822.html

Ironically, Martin v. Löwis, the author of PEP 393 originally started off
opposing an three-way internal representation, calling it "terrible":

https://mail.python.org/pipermail/python-3000/2006-September/003891.html

Another factor which I didn't see discussed anywhere is that Python strings
treat surrogates as normal code points. I believe that would be troublesome
for a UTF-8 implementation:

py> '\uDC37'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in
position 0: surrogates not allowed

but of course with a UCS-2 or UTF-32 implementation it is trivial: you just
treat the surrogate as another code point like any other.

[...]
> ISTM that most operations on strings are via iterators and thus agnostic
> to variable or fixed width encodings. 

Slicing is not.

start = text.find(":")
end = text.rfind("!")
assert end > start
chunk = text[start:end]

But even with iteration, we still would expect that indexes be consecutive:

for i, c in enumerate(text):
    assert c == text[i]

The complexity of those functions will be greatly increased with UTF-8. Of
course you can make it work, and you can even hide the fact that UTF-8 has
variable-width code points. But you can't have all three of:

- simplicity;
- memory efficiency;
- O(1) operations

with UTF-8.

But of course, I'd be happy for a competing Python implementation to use
UTF-8 and prove me wrong!

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.