Grapheme clusters, a.k.a.real characters
Chris Angelico
rosuav at gmail.com
Wed Jul 19 11:59:11 EDT 2017
On Thu, Jul 20, 2017 at 1:45 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> So let's assume we will expand str to accommodate the requirements of
> grapheme clusters.
>
> All existing code would still produce only traditional strings. The only
> way to introduce the new "super code points" is by invoking the
> str.canonical() method:
>
> text = "hyvää yötä".canonical()
>
> In this case text would still be a fully traditional string because both
> ä and ö are represented by a single code point in NFC. However:
>
> >>> q = unicodedata.normalize("NFC", "aq̈u")
> >>> len(q)
> 4
> >>> text = q.canonical()
> >>> len(text)
> 3
> >>> t[0]
> "a"
> >>> t[1]
> "q̈"
> >>> t[2]
> "u"
> >>> q2 = unicodedata.normalize("NFC", text)
> >>> len(q2)
> 4
> >>> text.encode()
> b'aq\xcc\x88u'
> >>> q.encode()
> b'aq\xcc\x88u'
Ahh, I see what you're looking at. This is fundamentally very similar
to what was suggested a few hundred posts ago: a function in the
unicodedata module which yields a string's combined characters as
units. So you only see this when you actually want it, and the process
of creating it is a form of iterating over the string.
This could easily be done, as a class or function in unicodedata,
without any language-level support. It might even already exist on
PyPI.
ChrisA
More information about the Python-list
mailing list