Grapheme clusters, a.k.a.real characters
Marko Rauhamaa
marko at pacujo.net
Wed Jul 19 05:53:14 EDT 2017
Chris Angelico <rosuav at gmail.com>:
> To be quite honest, I wouldn't care about that possibility. If I could
> design regex semantics purely from an idealistic POV, I would say that
> [xyzã], regardless of its encoding, will match any of the four
> characters "x", "y", "z", "ã".
>
> Earlier I posted a suggestion that a folding function be used when
> searching (for instance, it can case fold, NFKC normalize, etc).
> Unfortunately, this makes positional matching extremely tricky; if
> normalization changes the number of code points in the string, you
> have some fiddly work to do to try to find back the match location in
> the original (pre-folding) string. That technique works well for
> simple lookups (eg "find me all documents whose titles contain <this
> string>"), but a regex does more than that. As such, I am in favour of
> the regex engine defining a "character" as a base with all subsequent
> combining, so a single dot will match the entire combined character,
> and square bracketed expressions have the same meaning whether you're
> NFC or NFD normalized, or not normalized. However, that's the ideal
> situation, and I'm not sure (a) whether it's even practical to do
> that, and (b) how bad it would be in terms of backward compatibility.
Here's a proposal:
* introduce a building (predefined) class Text
* conceptually, a Text object is a sequence of "real" characters
* you can access each "real" character by its position in O(1)
* the "real" character is defined to be a integer computed as follows
(in pseudo-Python):
string = the NFC normal form of the real character as a string
rc = 0
shift = 0
for codepoint in string:
rc |= ord(codepoing) << shift
shift += 6
return rc
* t[n] evaluates to an integer
* the Text constructor takes a string or an integer
* str(Text) evaluates to the NFC encoding of the Text object
* Text.encode(...) works like str(Text).encode(...)
* regular expressions work with Text objects
* file system functions work with Text objects
Instead of introducing Text, all of this could also be done within the
str class itself:
* conceptually, an str object is a sequence of integers representing
Unicode code points *or* "real" characters
* ord(s) returns the code point or the integer (rc) from the
algorithm above
* chr(n) takes a valid code point or an rc value as defined above
* s.canonical() returns a string that has merged all multi-code-point
characters into single "real" characters
Each approach has its upsides and downsides.
Marko
More information about the Python-list
mailing list