Grapheme clusters, a.k.a.real characters
Steve D'Aprano
steve+python at pearwood.info
Tue Jul 18 22:07:56 EDT 2017
On Wed, 19 Jul 2017 12:09 am, Random832 wrote:
> On Fri, Jul 14, 2017, at 08:33, Chris Angelico wrote:
>> What do you mean about regular expressions? You can use REs with
>> normalized strings. And if you have any valid definition of "real
>> character", you can use it equally on an NFC-normalized or
>> NFD-normalized string than any other. They're just strings, you know.
>
> I don't understand how normalization is supposed to help with this. It's
> not like there aren't valid combinations that do not have a
> corresponding single NFC codepoint (to say nothing of the situation with
> e.g. Indic languages).
Normalisation helps. Suppose you want to search for é for example, a naive
regular expression engine will only find the exact representation you or your
editor happened to use:
U+00E9 LATIN SMALL LETTER E WITH ACUTE
or
U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT
but not both. By normalising, you ensure that both the text you are searching
and the regex you are searching for are in the same state: either composed to a
single code point U+00E9 or decomposed to two U+0065,0301 but never one in one
state and the other in the other.
For characters that don't include a canonical composition form, then there's no
problem: you will always be searching for a decomposed character using a base
character followed by combining characters, so there is no discrepancy and it
will just work.
> In principle probably a viable solution for regex would be to add
> character classes for base and combining characters, and then
> "[[:base:]][[:combining:]]*" can be used as a building block if
> necessary.
I don't know what that means.
Any code point (except for combining characters themselves) can be used as the
base, and the various kinds of combining characters have the Unicode category
property:
Mn (Mark, nonspacing)
Mc (Mark, spacing combining)
Me (Mark, enclosing)
If we're talking about combining accents and diacritics, the one we want is Mc.
But generally, we're not after "any old diacritic", we're after a specific one,
on a specific base.
--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.
More information about the Python-list
mailing list