Glyphs and graphemes [was Re: Cult-like behaviour]

Marko Rauhamaa marko at
Tue Jul 17 02:52:13 EDT 2018

INADA Naoki <songofacandy at>:

> On Tue, Jul 17, 2018 at 2:31 PM Marko Rauhamaa <marko at> wrote:
>> So I hope that by now you have understood my point and been able to
>> decide if you agree with it or not.
> I still don't understand what's your original point.
> I think UTF-8 vs UTF-32 is totally different from Python 2 vs 3.
> For example, string in Rust and Swift (2010s languages!) are *valid*
> UTF-8. There are strong separation between byte array and string, even
> they use UTF-8. They looks similar to Python 3, not Python 2.

I won't comment on Rust and Swift because I don't know them.

> And Python can use UTF-8 for internal encoding in the future. AFAIK,
> PyPy tries it now. After they succeeded, I want to try port it to
> CPython after we removed legacy Unicode APIs. (ref PEP 393)

How CPython3 implements str objects internally is not what I'm talking
about. It's the programmer's model in any compliant Python3

Both Python2 and Python3 provide two forms of string, one containing
8-bit integers and another one containing 21-bit integers. Python3 made
the situation worse in a minor way and a major way. The minor way is the
uglification of the byte string notation. The major way is the wholesale
preference or mandating of Unicode strings in numerous standard-library

> So "UTF-8 is better than UTF-32" is totally different problem from
> "Python 2 is better than 3".

Unix programming is smoothest when the programmer can operate on bytes.
Bytes are the mother tongue of Unix, and programming languages should
not try to present a different model to the programmer.

> Is your point "accepting invalid UTF-8 implicitly by default is better
> than explicit 'surrogateescape' error handler" like Go?
> (It's 2010s languages with UTF-8 based string too, but accept invalid
> UTF-8).

I won't comment on Go, either.


More information about the Python-list mailing list