[Python-Dev] Import and unicode: part two

Wed Jan 19 13:34:02 CET 2011

Hi,

I patched Python 3.2 to support modules with non-ASCII paths (*). It
works well on all operating systems. But the task is not completly done:

 (a) Python 3 doesn't support non-ASCII module names 
 (b) Python 3 doesn't support unencodable characters in the module path

I would like to know if we need to support that. Terry J. Reedy wrote
(issue #10828): "I think bugs in core syntax should have high priority.
I appreciate your work toward fixing it."

I wrote a patch (issue #3080) fixing both points. If you agree that both
issues should be fixed, I will fix them in Python 3.3.

(a) is the issue #10828 reported recently (january 2011): "import
gui_jämföra" doesn't work with a locale encoding different than UTF-8
(so it doesn't work on Windows).

(b) is specific to Windows: FAT32 and NTFS filesystems store filenames
in unicode, but Python encodes paths to the ANSI code page (which is a
very small subset of Unicode). If a character cannot be encoded to the
code page, you cannot load a module. Eg. add a japanese character in a
directory name on a Windows using cp1252 (english) code page. I don't
think that (b) was already reported by an user, it's more a theorical
problem.

My patch is huge, but it simplifies the code. We doesn't need to
regulary convert from/to UTF-8. And for the functions using
PyUnicodeObject objects (and not a Py_UNICODE* buffer): PyUnicodeObject
stores the string length (it avoids calls to strlen()) and
PyUnicode_FromFormat() doesn't need a buffer size (no risk of buffer
overflow). I suppose that it makes Python faster, but I didn't try.

(*) Python 3.2 doesn't support non-ASCII in the module *name*, only in
the path (sys.path).

Victor Stinner