[Python-Dev] PEP 277 (unicode filenames): please review

14 Aug 2002 08:33:13 +0200

Jack Jansen <Jack.Jansen@oratrix.com> writes:

> If I understand the unicode standard (according to unicode.org)
> correctly this means that MacOS stores filenames in NFD normalized
> form, with all combining characters split out, and this is the
> preferred normalized form. Am I correct here?

You are correct that this is likely the form that OS X uses on-disk,
and at the APIs. This is not really the preferred form - W3C favours
and advocates NFC - precisely because it is easier to transform into
legacy encodings (as you just observed).

> But, even if NFC is the preferred normalized form (the documents I saw
> hinted that this may have been the case in previous Unicode
> standards:-): both NFC and NFD renditions of this string are legal
> unicode, aren't they? And if they are then both should be converted to
> the same latin-1 string, shouldn't they?

Yes, and yes.

> Do I misunderstand something, or this this a bug (limitation?) in the
> unicode->latin-1 decoder?

It's a limitation, in all codecs. Contributions of normalization code
are welcome. Since this is hard work, this is unlikely to be fixed in
Python 2.3 - unless somebody has a really good incentive for fixing
it.

Regards,
Martin