Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

30 Sep 2008

      Le Monday 29 September 2008 18:45:28 Georg Brandl, vous avez écrit :
...
If I had to choose, I'd still argue for the modified UTF-8 as filesystem
encoding (if it were UTF-8 otherwise), despite possible surprises when a
such-encoded filename escapes from Python.
If I understand correctly this solution. The idea is to change the default 
file system encoding, right? Eg. if your filesystem is UTF-8, use ISO-8859-1 
to make sure that UTF-8 conversion will never fail.

Let's try with an ugly directory on my UTF-8 file system:
$ find
.
./têste
./ô
./a?b
./dossié
./dossié/abc
./dir?name
./dir?name/xyz

Python3 using encoding=ISO-8859-1:
...
...
...
import os; os.listdir(b'.')
[b't\xc3\xaaste', b'\xc3\xb4', b'a\xffb', b'dossi\xc3\xa9', b'dir\xffname']
files=os.listdir('.'); files
['tÃªste', 'Ã´', 'aÿb', 'dossiÃ©', 'dirÿname']
open(files[0]).close()
os.listdir(files[-1])
['xyz']
Ok, I have unicode filenames and I'm able to open a file and list a directory. 
The problem is now to display correctly the filenames.

For me "unicode" sounds like "text (characters) encoded in the correct 
charset". In this case, unicode is just a storage for *bytes* in a custom 
charset.

How can we mix <custom unicode (bytes encoded in ISO-8859-1)> with <real 
unicode>? Eg. os.path.join('dossiÃ©', "fichié") : first argument is encoded 
in ISO-8859-1 whereas the second argument is encoding in Unicode. It's 
something like that:
   str(b'dossi\xc3\xa9', 'ISO-8859-1') + '/' + 'fichi\xe9'

Whereas the correct (unicode) result should be: 
   'dossié/fichié'
as bytes in ISO-8859-1:
   b'dossi\xc3\xa9/fichi\xc3\xa9'
as bytes in UTF-8:
   b'dossi\xe9/fichi\xe9'

Change the default file system encoding to store bytes in Unicode is like 
introducing a new Python type: <fake Unicode for filename hacks>.

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/