[Python-3000] New proposition for Python3 bytes filename issue

Victor Stinner victor.stinner at haypocalc.com
Tue Sep 30 02:02:38 CEST 2008


Le Monday 29 September 2008 18:45:28 Georg Brandl, vous avez écrit :
> If I had to choose, I'd still argue for the modified UTF-8 as filesystem
> encoding (if it were UTF-8 otherwise), despite possible surprises when a
> such-encoded filename escapes from Python.

If I understand correctly this solution. The idea is to change the default 
file system encoding, right? Eg. if your filesystem is UTF-8, use ISO-8859-1 
to make sure that UTF-8 conversion will never fail.

Let's try with an ugly directory on my UTF-8 file system:
$ find
.
./têste
./ô
./a?b
./dossié
./dossié/abc
./dir?name
./dir?name/xyz

Python3 using encoding=ISO-8859-1:
>>> import os; os.listdir(b'.')
[b't\xc3\xaaste', b'\xc3\xb4', b'a\xffb', b'dossi\xc3\xa9', b'dir\xffname']
>>> files=os.listdir('.'); files
['têste', 'ô', 'aÿb', 'dossié', 'dirÿname']
>>> open(files[0]).close()
>>> os.listdir(files[-1])
['xyz']

Ok, I have unicode filenames and I'm able to open a file and list a directory. 
The problem is now to display correctly the filenames.

For me "unicode" sounds like "text (characters) encoded in the correct 
charset". In this case, unicode is just a storage for *bytes* in a custom 
charset.

How can we mix <custom unicode (bytes encoded in ISO-8859-1)> with <real 
unicode>? Eg. os.path.join('dossié', "fichié") : first argument is encoded 
in ISO-8859-1 whereas the second argument is encoding in Unicode. It's 
something like that:
   str(b'dossi\xc3\xa9', 'ISO-8859-1') + '/' + 'fichi\xe9'

Whereas the correct (unicode) result should be: 
   'dossié/fichié'
as bytes in ISO-8859-1:
   b'dossi\xc3\xa9/fichi\xc3\xa9'
as bytes in UTF-8:
   b'dossi\xe9/fichi\xe9'

Change the default file system encoding to store bytes in Unicode is like 
introducing a new Python type: <fake Unicode for filename hacks>.

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/


More information about the Python-3000 mailing list