unicode filenames

Sun Mar 2 06:58:36 EST 2003

Andrew Dalke <adalke at mindspring.com> writes:

> Okay, so it seems like no one knows how to handle unicode filenames
> under Unix.  Perhaps the following is the proper behaviour?

"Unix" is a too wide term here. Different *installations* of the very
same software product may use different means to represent non-ASCII
characters in file names (even different directories in the same
installation); it is all convention how to interpret them. Python is
somewhat at a loss in guessing the "right" thing.

The emerging convention is that the locale's codeset determines the
encoding of file names. This convention is used in a number of Linux
distributions, and other Unices.

>    1) there is a default filesystem encoding, which is initialized
>        to None if os.path.supports_unicode_file is True, otherwise
>        it's set to sys.getdefaultencoding()

Since Python 2.2 (I believe), invoking locale.setlocale will set the
file system default encoding to what the system's nl_langinfo(CODESET)
returns - provided the system has both nl_langinfo and CODESET.

>    2) there is a registration system which is used to define encodings
>        used for different mount locations.  If a filename/dirname is
>        not covered, sue the default filesystem encoding

Ok, I'll sue :-)

Such a scenario should not be supported. The encoding should be
uniform in all components of a path, and it is the system
administrator's task to make sure this is the case.

> If this makes sense, should it be added to Python's core?

Not in the way you have described it. Because Unix is tricky (and NT+
is much more advanced) in this respect, the existing PEP deliberately
targets NT+ only, leaving Unix for further study.

Regards,
Martin