Grapheme clusters, a.k.a.real characters
Marko Rauhamaa
marko at pacujo.net
Tue Jul 18 14:56:06 EDT 2017
Chris Angelico <rosuav at gmail.com>:
> On Wed, Jul 19, 2017 at 4:31 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> Chris Angelico <rosuav at gmail.com>:
>>
>>> On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>>> Yes. Also, not every letter can be normalized to a single codepoint so
>>>> NFC is not a way out. For example,
>>>>
>>>> re.match("^[q̈]$", "q̈")
>>>>
>>>> returns None regardless of normalization.
> [...]
>
> What I *think* you're asking for is for square brackets in a regex to
> count combining characters with their preceding base character.
Yes. My example tries to match a single character against a single
character.
> That would make a lot of sense, and would actually be a reasonable
> feature to request. (Probably as an option, in case there's a backward
> compatibility issue.)
There's the flag re.IGNORECASE. In the same vein, it might be useful to
have re.IGNOREDIACRITICS, which would match
re.match("^[abc]$", "ä", re.IGNOREDIACRITICS)
regardless of normalization.
Marko
More information about the Python-list
mailing list