Grapheme clusters, a.k.a.real characters
Marko Rauhamaa
marko at pacujo.net
Fri Jul 14 09:31:33 EDT 2017
Steve D'Aprano <steve+python at pearwood.info>:
> These are only a *few* of the *easy* questions that need to be
> answered before we can even consider your question:
>
>> So the question is, should we have a third type for text. Or should
>> the semantics of strings be changed to be based on characters?
Sure, but if they can't be answered, what good is there in having
strings (as opposed to bytes). What problem do strings solve? What
operation depends on (or is made simpler) by having strings (instead of
bytes)?
We are not even talking about some exotic languages, but the problem is
right there in the middle of Latin-1. We can't even say what
len("è")
should return. And we may experience:
>>> ord("è")Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?
As it stands, we have
è --[encode>-- Unicode --[reencode>-- UTF-8
Why is one encoding format better than the other?
Marko
More information about the Python-list
mailing list