PEP 393 vs UTF-8 Everywhere
steve+python at pearwood.info
Sat Jan 21 08:56:03 EST 2017
On Sat, 21 Jan 2017 09:35 am, Pete Forman wrote:
> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation?
I've read over the PEP, and the email discussion, and there is very little
mention of UTF-8, and as far as I can see no counter-proposal for using
UTF-8. However, there are a few mentions of UTF-8 that suggest that the
participants were aware of it as an alternative, and simply didn't think it
was worth considering. I don't know why.
You can read the PEP and the mailing list discussion here:
Mailing list discussion starts here:
Stefan Behnel (author of Cython) states that UTF-8 is much harder to use:
I see nobody challenging that claim, so perhaps there was simply enough
broad agreement that UTF-8 would have been more work and so nobody wanted
to propose it. I'm just guessing though.
Perhaps it would have been too big a change to adapt the CPython internals
to variable-width UTF-8 from the existing fixed-width UTF-16 and UTF-32
(I know that UTF-16 is actually variable-width, but Python prior to PEP 393
treated it as if it were fixed.)
There was a much earlier discussion about the internal implementation of
including some discussion of UTF-8:
It too proposed using a three-way internal implementation, and made it clear
that O(1) indexing was an requirement.
Here's a comment explicitly pointing out that constant-time indexing is
wanted, and that using UTF-8 with a two-level table destroys any space
advantage UTF-8 might have:
Ironically, Martin v. Löwis, the author of PEP 393 originally started off
opposing an three-way internal representation, calling it "terrible":
Another factor which I didn't see discussed anywhere is that Python strings
treat surrogates as normal code points. I believe that would be troublesome
for a UTF-8 implementation:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in
position 0: surrogates not allowed
but of course with a UCS-2 or UTF-32 implementation it is trivial: you just
treat the surrogate as another code point like any other.
> ISTM that most operations on strings are via iterators and thus agnostic
> to variable or fixed width encodings.
Slicing is not.
start = text.find(":")
end = text.rfind("!")
assert end > start
chunk = text[start:end]
But even with iteration, we still would expect that indexes be consecutive:
for i, c in enumerate(text):
assert c == text[i]
The complexity of those functions will be greatly increased with UTF-8. Of
course you can make it work, and you can even hide the fact that UTF-8 has
variable-width code points. But you can't have all three of:
- memory efficiency;
- O(1) operations
But of course, I'd be happy for a competing Python implementation to use
UTF-8 and prove me wrong!
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.
More information about the Python-list