[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue
Victor Stinner
victor.stinner at haypocalc.com
Tue Sep 30 02:02:38 CEST 2008
Le Monday 29 September 2008 18:45:28 Georg Brandl, vous avez écrit :
> If I had to choose, I'd still argue for the modified UTF-8 as filesystem
> encoding (if it were UTF-8 otherwise), despite possible surprises when a
> such-encoded filename escapes from Python.
If I understand correctly this solution. The idea is to change the default
file system encoding, right? Eg. if your filesystem is UTF-8, use ISO-8859-1
to make sure that UTF-8 conversion will never fail.
Let's try with an ugly directory on my UTF-8 file system:
$ find
.
./têste
./ô
./a?b
./dossié
./dossié/abc
./dir?name
./dir?name/xyz
Python3 using encoding=ISO-8859-1:
>>> import os; os.listdir(b'.')
[b't\xc3\xaaste', b'\xc3\xb4', b'a\xffb', b'dossi\xc3\xa9', b'dir\xffname']
>>> files=os.listdir('.'); files
['têste', 'ô', 'aÿb', 'dossié', 'dirÿname']
>>> open(files[0]).close()
>>> os.listdir(files[-1])
['xyz']
Ok, I have unicode filenames and I'm able to open a file and list a directory.
The problem is now to display correctly the filenames.
For me "unicode" sounds like "text (characters) encoded in the correct
charset". In this case, unicode is just a storage for *bytes* in a custom
charset.
How can we mix <custom unicode (bytes encoded in ISO-8859-1)> with <real
unicode>? Eg. os.path.join('dossié', "fichié") : first argument is encoded
in ISO-8859-1 whereas the second argument is encoding in Unicode. It's
something like that:
str(b'dossi\xc3\xa9', 'ISO-8859-1') + '/' + 'fichi\xe9'
Whereas the correct (unicode) result should be:
'dossié/fichié'
as bytes in ISO-8859-1:
b'dossi\xc3\xa9/fichi\xc3\xa9'
as bytes in UTF-8:
b'dossi\xe9/fichi\xe9'
Change the default file system encoding to store bytes in Unicode is like
introducing a new Python type: <fake Unicode for filename hacks>.
--
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
More information about the Python-Dev
mailing list