[Python-Dev] PEP 393 Summer of Code Project

Hagen Fürstenau hagen at zhuliguan.net
Thu Sep 1 17:30:10 CEST 2011


> Ok, I thought there was also a form normalized (denormalized?) to
> decomposed form. But I'll take your word.

If I understood the example correctly, he needs a mixed form, with some
characters decomposed and some composed (depending on which one looks
better in the given font). I agree that this sound more like a font
problem, but it's a wide spread font problem and it may be necessary to
address it in an application.

But this is only one example of why an application-specific concept of
graphemes different from the Unicode-defined normalized forms can be
useful. I think the very concept of a grapheme is context, language, and
culture specific. For example, in Chinese Pinyin it would be very
natural to write tone marks with composing diacritics (i.e. in
decomposed form). But then you have the vowel "ü" and it would be
strange to decompose it into an "u" and combining diaeresis. So
conceptually the most sensible representation of "lǜ" would be neither
the composed not the decomposed normal form, and depending on its needs
an application might want to represent it in the mixed form (composing
the diaeresis with the "u", but leaving the grave accent separate).

There must be many more examples where the conceptual context determines
the right composition, like for "ñ", which is Spanish is certainly a
grapheme, but in mathematics might be better represented as n-tilde. The
bottom line is that, while an array of Unicode code points is certainly
a generally useful data type (and PEP 393 is a great improvement in this
regard), an array of graphemes carries many subtleties and may not be
nearly as universal. Support in the spirit of unicodedata's
normalization function etc. is certainly a good thing, but we shouldn't
assume that everyone will want Python to do their graphemes for them.

- Hagen



More information about the Python-Dev mailing list