[Python-Dev] PEP 277 (unicode filenames): please review

13 Aug 2002 22:02:07 +0200

Guido van Rossum <guido@python.org> writes:

> > It could be that Apple is decomposing the filenames before comparing
> > them. Either way works.
> 
> Hm, that sucks (either way) -- because you get unnormalized Unicode
> out of directory listings, which is harder to turn into local
> encodings.

Notice that, most likely, Apple *does* normalize them - they just use
Normal Form D (which favours decomposition, instead of using
precomposed characters) - this is what Apple apparently calls
"canonical".

That choice is not surprising - NFD is "more logical", as precomposed
characters are available only arbitrarily (e.g. the WITH TILDE
combinations exist for a, i, e, n, o, u, v, y, but not for, say, x).

The Unicode FAQ
(http://www.unicode.org/unicode/faq/normalization.html) says

Q: Which forms of normalization should I support?

A: The choice of which to use depends on the particular program or
system.  The most commonly supported form is NFC, since it is more
compatible with strings converted from legacy encodings. This is also
the choice for the web, as per the recommendations in "Character Model
for the World Wide Web" from the W3C. The other normalization forms
are useful for other domains.

So I guess Python should atleast provide NFC - precisely because of
the legacy encodings.

Regards,
Martin