[issue10952] Don't normalize module names to NFKC?

STINNER Victor report at bugs.python.org
Thu Jan 20 12:07:24 CET 2011


STINNER Victor <victor.stinner at haypocalc.com> added the comment:

> b) what if the file system implementation mangles file names.
> 
> I'd use the same approach as with case-insensitive lookups: verify
> that the file we read is really the one we want.

Only Mac OS X and the HFS+ filesystem normalize filenames (to a variant
of NFD). But such normalization is a good thing! I mean that I don't
think that we have anything to do for that.

---
The user creates café.py file, name written with the keyboard in NFD:
cafe\u0301 (this is very unlikely, all operating systems prefer NFC for
the keyboard, but it's just to give an example). Mac OS X normalizes the
filename to NFD: cafe\u0301.py is created in the filesystem.

Then (s)he tries to import the café module: write "import café" with
his/her NFD keyboard. Python normalizes café to NFKC (caf\xe9) and then
tries to read caf\xe9.py. Mac OS X normalizes the filename to NFD: cafe
\u0301.py, and this file, so it works as expected.
---

I suppose that any filesystem normalization is good, because it avoids
surprising behaviours (eg. having two files cafe\u0301 and caf\xe9 with
names rendered exactly the same on screen). We should maybe patch
Windows, Mac OS, Linux & co to normalize to NFKC :-)

> a) how can users make sure that they name the files correctly?
>
>  For a), wrt. "I'm not able to write U+03BC with my keyboard", I say
> "tough luck - don't use that character in a module name, then".
> Somebody with a Greek keyboard will have no problems doing that. 

Even if I try to agree with "don't use that character in a module name":
it can be surprising for an English who would like to use µTorrent (U
+00B5) module name in his/her project. She/He can creates µTorrent.py
with his non-Greek keyboard (\xb5Torrent.py), but than import µTorrent
(import \xb5Torrent) fails: "ImportError: No module named µTorrent". The
error message is "ImportError: No module named \u03BCTorrent": the
identifier is normalized, but remember that µ (U+00B5) and μ (U+03BC)
are rendered exactly the same by most fonts.

We should at least document this surprising behaviour in the import
documentation. Something like:

<< WARNING: Non-ASCII characters in module names are normalized to NFKC
by the Python parser ([PEP 3131]). For example, import µTorrent (µ: U
+00B5) is normalized to import μTorrent (μ: U+03BC): Python will try to
open "\u03BCTorrent.py" (or "\u03BCTorrent/__init__.py"), and not
"\xB5Torrent.py" (or "\xB5Torrent/__init__.py"). >>

> This is really the same as any other non-ASCII character which you are
> unable to type: it just means that you can't conveniently enter the
> respective Python identifier. Just try importing "саша", for example.
> Get a different keyboard.

I disagree. For identifiers in the source code, it works (transparently)
as expected.

A Greek starts a project using µTorrent (\u03BCTorrent) identifier in
its source code (a variable name, not a module name). An English writes
a patch using µTorrent written with \xB5Torrent: both forms are accepted
by Python, and it works.

"exec"))
it works

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10952>
_______________________________________


More information about the Python-bugs-list mailing list