[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

Georg Brandl g.brandl at gmx.net
Tue Sep 30 19:28:23 CEST 2008


Guido van Rossum schrieb:

>> With the filenames decoded by UTF-8, your files named têste, ô, dossié will
>> be displayed and handled correctly. The others are *invalid* in the filesystem
>> encoding UTF-8 and therefore would be represented by something like
>>
>> u'dir\uXXffname' where XX is some private use Unicode namespace. It won't look
>> pretty when printed, but then, what do other applications do? They e.g. display
>> a question mark as you show above, which is not better in terms of readability.
>>
>> But it will work when given to a filename-handling function. Valid filenames
>> can be compared to Unicode strings.
>>
>> A real-world example: OpenOffice can't open files with invalid bytes in their
>> name. They are displayed in the "Open file" dialog, but trying to open fails.
>> This regularly drives me crazy. Let's not make Python not work this way too,
>> or, even worse, not even display those filenames.
> 
> How can it *regularly* drive you crazy when "the majority of fie names
> [...] encoded correctly" (as you assert above)?

Because Office files are a) often named with long, seemingly descriptive
filenames, which invariably means umlauts in German, and b) often sent around
between systems, creating encoding problems.

Having seen how much controversy returning an invalid Unicode string sparks,
and given that it really isn't obvious to the newbie either, I think I now agree
that dropping filenames when calling a listdir() that returns Unicode filenames
is the best solution. I'm a little uneasy with having one function for both
bytes and Unicode return, because that kind of str/unicode mixing I thought we
had left behind in 2.x, but of course can live with it.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.



More information about the Python-Dev mailing list