Le Monday 29 September 2008 18:45:28 Georg Brandl, vous avez écrit :
If I had to choose, I'd still argue for the modified UTF-8 as filesystem encoding (if it were UTF-8 otherwise), despite possible surprises when a such-encoded filename escapes from Python.
If I understand correctly this solution. The idea is to change the default file system encoding, right? Eg. if your filesystem is UTF-8, use ISO-8859-1 to make sure that UTF-8 conversion will never fail. Let's try with an ugly directory on my UTF-8 file system: $ find . ./têste ./ô ./a?b ./dossié ./dossié/abc ./dir?name ./dir?name/xyz Python3 using encoding=ISO-8859-1:
import os; os.listdir(b'.') [b't\xc3\xaaste', b'\xc3\xb4', b'a\xffb', b'dossi\xc3\xa9', b'dir\xffname'] files=os.listdir('.'); files ['têste', 'ô', 'aÿb', 'dossié', 'dirÿname'] open(files[0]).close() os.listdir(files[-1]) ['xyz']
Ok, I have unicode filenames and I'm able to open a file and list a directory. The problem is now to display correctly the filenames. For me "unicode" sounds like "text (characters) encoded in the correct charset". In this case, unicode is just a storage for *bytes* in a custom charset. How can we mix <custom unicode (bytes encoded in ISO-8859-1)> with <real unicode>? Eg. os.path.join('dossié', "fichié") : first argument is encoded in ISO-8859-1 whereas the second argument is encoding in Unicode. It's something like that: str(b'dossi\xc3\xa9', 'ISO-8859-1') + '/' + 'fichi\xe9' Whereas the correct (unicode) result should be: 'dossié/fichié' as bytes in ISO-8859-1: b'dossi\xc3\xa9/fichi\xc3\xa9' as bytes in UTF-8: b'dossi\xe9/fichi\xe9' Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>. -- Victor Stinner aka haypo http://www.haypocalc.com/blog/