[issue10828] Cannot use nonascii utf8 in names of files imported from

Sun Jan 9 03:40:25 CET 2011

STINNER Victor <victor.stinner at haypocalc.com> added the comment:

> ANSI code page: cp1252 ...os.fsencode('ä') => b'\xe4'

Hum, I ran your example with a debugger, and ok, I now remember the whole thing.

I fixed Python to support non-ASCII characters (... only non-ASCII characters encodable to the ANSI code page for Windows) in the *search path*, not in the module name.

The import machinery encodes each search path to the filesystem encoding, but it encodes the module name to UTF-8. Concatenate two byte strings encoded to different encodings doesn't work (it leads to mojibake).

To fix this problem, there are two solutions:

 a) encode the module name to the fileystem encoding
 b) manipulate paths as unicode strings; to access the filesystem: use the wide character (unicode) API of Windows and encode paths to the filesystem encoding on UNIX/BSD

It is easier to implement (a) than (b), but (a) only gives you the support of paths and module names encodable to the ANSI code page.

(b) gives you the full unicode support because it never *encodes* paths to the filesystem encoding, but it may *decodes* paths from the filesystem encoding. Encode a path raises a UnicodeEncodeError on the first character not encodable to the ANSI code page, whereas decode a path never fails (except if the user manually changed its code page to a rare ANSI code page like UTF-8).

I implemented (b) in my import_unicode SVN branch, but as I wrote, I still have some work to merge this branch into py3k, and anyway I will wait for Python 3.3.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10828>
_______________________________________