Grapheme clusters, a.k.a.real characters
Marko Rauhamaa
marko at pacujo.net
Tue Jul 18 14:31:21 EDT 2017
Chris Angelico <rosuav at gmail.com>:
> On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> Yes. Also, not every letter can be normalized to a single codepoint so
>> NFC is not a way out. For example,
>>
>> re.match("^[q̈]$", "q̈")
>>
>> returns None regardless of normalization.
>
> In what language or context would you actually want to do this?
I could have picked more realistic examples: Classic Greek or Hebrew,
for example.
However, someone might actually use even "q̈" in a real setting. First of
all, it *is* a legal character. Secondly, people sometimes combine
characters in an ad-hoc fashion. Thirdly, remember the case of
Esperanto, which blessed the world with the letters
ĉ ĝ ĥ ĵ ŝ ŭ
Esperanto's venerable history finally awarded those characters a
code-point status in Unicode. However, around the year 2000, it was
still commonplace to use all sorts of tricks to type them on the
Internet:
ch gh hh jj sh u
^c ^g ^h ^j ^s ^u
cx gx hx jx sx ux
For all we know, someone somewhere might be cooking up a language that
depends on "q̈".
Marko
More information about the Python-list
mailing list