[Python-Dev] Unicode strings as filenames
Skip Montanaro
skip@pobox.com (Skip Montanaro)
Thu, 3 Jan 2002 17:11:10 -0600
>>>>> "Martin" =3D=3D Martin v Loewis <martin@v.loewis.de> writes:
>> What's the correct way to deal with filenames in a Unicode
>> environment? Consider this:
>>
>> >>> import site site.encoding
>> 'latin-1'
Martin> Setting site.encoding is certainly the wrong thing to do. H=
ow
Martin> can you know all users of your system use latin-1?
Why is setting site.encoding appropriate to your environment at the tim=
e you
install Python wrong? I can't know that all users of my system (whatev=
er
the definition of "my system" is) will use latin-1. Somewhere along th=
e way
I have to make some assumptions, however.
On any given computer I assume the people who install Python will s=
et
site.encoding appropriate to their environment.
The example I used was latin-1 simply because the folks I'm working=
with
are in Austria and they came up with the example. I assume the bes=
t
default encoding for them is latin-1.
The application writers themselves will have no problem restricting=
internal filenames to be ascii. I assume it users want to save fil=
es of
their own, they will choose characters from the Unicode character s=
et
they use most frequently.
So, my example used latin-1. I could just as easily have chosen someth=
ing
else.
Martin> On my system, the following works fine
Martin> >>> import locale ; locale.setlocale(locale.LC_ALL,"")
Martin> 'LC_CTYPE=3Dde_DE;LC_NUMERIC=3Dde_DE;LC_TIME=3Dde_DE;LC_COL=
LATE=3DC;LC_MONETARY=3Dde_DE;LC_MESSAGES=3Dde_DE;LC_PAPER=3Dde_DE;LC_NA=
ME=3Dde_DE;LC_ADDRESS=3Dde_DE;LC_TELEPHONE=3Dde_DE;LC_MEASUREMENT=3Dde_=
DE;LC_IDENTIFICATION=3Dde_DE'
Martin> >>> a =3D "abc\xe4\xfc\xdf.txt" u =3D unicode (a, "latin-1"=
) open(u, "w")
Martin> <open file 'abc=E4=FC=DF.txt', mode 'w' at 0x8173e88>
Martin> On Unix, your best bet for file names is to trust the user'=
s
Martin> locale settings. If you do that, open will accept Unicode
Martin> objects.
Martin> What is your locale?
The above setlocale call prints
'LC_CTYPE=3Den_US;LC_NUMERIC=3Den_US;LC_TIME=3Den_US;LC_COLLATE=3De=
n_US;LC_MONETARY=3Den_US;LC_MESSAGES=3Den_US;LC_PAPER=3Den;LC_NAME=3Den=
;LC_ADDRESS=3Den;LC_TELEPHONE=3Den;LC_MEASUREMENT=3Den;LC_IDENTIFICATIO=
N=3Den'
I can't get to the machines in Austria right now to see how their local=
es
are set, though I suspect they haven't fiddled their LC_* environment,
because they are having the problems I described.
>> Is that the correct approach? Apparently Python's file object
>> doesn't do this under the covers. Should it?
Martin> No. There is no established convention, on Unix, how to do
Martin> non-ASCII file names. If anything, following the user's loc=
ale
Martin> setting is the most reasonable thing to do; this should be =
in
Martin> synch of how the user's terminal displays characters. The P=
ython
Martin> installations' default encoding is almost useless, and shou=
ldn't
Martin> be changed.
Martin> On Windows, things are much better, since there a notion of=
Martin> Unicode file names in the system.
This suggests to me that the Python docs need some introductory materia=
l on
this topic. It appears to me that there are two people in the Python
community who live and breathe this stuff are you, Martin, and Marc-And=
r=E9.
For most of the rest of us, especially if we've never conciously writte=
n
code for consumption outside an ascii environment, the whole thing just=
looks like a quagmire.
Skip