[Python-Dev] PEP 277 (unicode filenames): please review

Martin v. Loewis martin@v.loewis.de
14 Aug 2002 08:23:42 +0200


Skip Montanaro <skip@pobox.com> writes:

> What's the current behavior?  If my program receives an input in utf-8
> (let's say it comes from a form on a website), what form will it be in, or
> can't I tell?  

In general, you cannot tell in advance - it will depend on the data
source.

W3C advocates "early normalization" towards "NFC", meaning that in the
Internet, you should always see NFC data - unless you are primary data
source, e.g. by reading from a terminal, or after decoding some legacy
encoding. It turns out that most Python codecs will produce NFC
already, so normalization to NFC would be required only for user input,
and - as it turns out - when reading file names on OS X.

> Is it possible I will get spurious inequalities today if I compare
> two different unicode objects which were created from different
> sources and in different normal forms?

If they are in different normal forms, you *will* get inequalities
reliably. In the real world, inequalities will be spurious.

> What about a string and a unicode object?  Where can I read all
> about it (Python and unicode normalization)?

Python does no normalization, so there is nothing to read. For
Unicode, you may want to start with the Normalization FAQ

http://www.unicode.org/unicode/faq/normalization.html

Regards,
Martin