[Python-Dev] Re: String module

Martin v. Loewis martin@v.loewis.de
30 May 2002 08:43:48 +0200


Guido van Rossum <guido@python.org> writes:

> > This reminds me that I often miss, in the standard `ctype.h' and related,
> > a function that would un-combine a character into its base character and
> > its diacritic, and the complementary re-combining function.
[...]
> I bet the Unicode standard has a standard way to do this.  

This is called 'unicode normalization forms'. Each "pre-combined"
character can also be represented as a base character, and a
"combining diacritic". There are symmetric normalization forms: NFC
favours pre-combined characters, NFD favours combining characters.

There is also a "compatibility decomposition" (K), where e.g. ANGSTROM
SIGN decomposes to LATIN CAPITAL LETTER A WITH RING ABOVE.

> Maybe we can implement that, and then project the same interface on
> 8-bit characters?

Not really. Needing to know the character set is one issue; the other
issue is that the stand-alone diacritic characters in ASCII are *not*
combining. We could certainly provide a mapping between the Unicode
combining diacritics and the stand-alone diacritics, say as a codec,
but that would be quite special-purpose.

Providing a good normalization library is necessary, though, since
many other algorithms (both from W3C and IETF) require Unicode
normalization as part of the processing (usually to NFKC).

Regards,
Martin