[Python-Dev] Unicode strings as filenames

Martin v. Loewis martin@v.loewis.de
Fri, 4 Jan 2002 00:34:25 +0100


>     >> What's the correct way to deal with filenames in a Unicode
>     >> environment?  Consider this:
>     >>
>     >> >>> import site site.encoding
>     >> 'latin-1'
> 
>     Martin> Setting site.encoding is certainly the wrong thing to do. How
>     Martin> can you know all users of your system use latin-1?
> 
> Why is setting site.encoding appropriate to your environment at the time you
> install Python wrong?  I can't know that all users of my system (whatever
> the definition of "my system" is) will use latin-1.  Somewhere along the way
> I have to make some assumptions, however.

Well, then accept the assumption that almost everybody will use an
ASCII superset. That may be still wrong, for the case of EBCDIC users,
but those are rare on Unix.

However, on our typical Unix system, three different encodings are in
use: ISO-8859-1 (for tradition), ISO-8859-15 (because it has the
Euro), and UTF-8 (because it removes all the limitations). Notice that
all of our users speak German, and we still could not set a meaningful
site.encoding except for 'ascii'.

>     On any given computer I assume the people who install Python will set
>     site.encoding appropriate to their environment.

That is probably wrong. Most users will install precompiled packages,
and thus site.py will have the value that the package held, which will
be 'ascii' for most packages.

>     The example I used was latin-1 simply because the folks I'm working with
>     are in Austria and they came up with the example.  I assume the best
>     default encoding for them is latin-1.

Well, latin-1 does not have a Euro sign, which may be more and more of
a problem.

>     The application writers themselves will have no problem restricting
>     internal filenames to be ascii.  I assume it users want to save files of
>     their own, they will choose characters from the Unicode character set
>     they use most frequently.

That is a meaningful assumption. However, it is one that you have to
make in your application, not one that you should users expect to make
in their Python installations.

> The above setlocale call prints
>
> 'LC_CTYPE=en_US;LC_NUMERIC=en_US;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en;LC_NAME=en;LC_ADDRESS=en;LC_TELEPHONE=en;LC_MEASUREMENT=en;LC_IDENTIFICATION=en'

You may want to extend your system to support the same configuration
that your users have, i.e. you might want to install an Austrian
locale on your system, and set LANG to de_AT. If your system also sets
all the LC_ variables for you, I recommend to unset them - setting
LANG is enough (to override all other LC_ variables, setting LC_ALL to
de_AT should also work).

> I can't get to the machines in Austria right now to see how their locales
> are set, though I suspect they haven't fiddled their LC_* environment,
> because they are having the problems I described.

If if they set the environment variables, they'd still have the problem
because your application doesn't call setlocale.

I do expect that they have set LANG to de_AT, or de_AT.ISO-8859-1.

Perhaps they also have this problem because they use Python 2.1 or
earlier.

> This suggests to me that the Python docs need some introductory material on
> this topic.  It appears to me that there are two people in the Python
> community who live and breathe this stuff are you, Martin, and Marc-André.
> For most of the rest of us, especially if we've never conciously written
> code for consumption outside an ascii environment, the whole thing just
> looks like a quagmire.

Well, I'd happily review any introductory material somebody else
writes :-)

Regards,
Martin