On Mon, Sep 29, 2008 at 11:22 PM, Georg Brandl <g.brandl@gmx.net> wrote:
No, that was not what I meant (although it is another possibility). As I wrote, Martin's proposal that I support here is using the modified UTF-8 codec that successfully roundtrips otherwise invalid UTF-8 data.
I thought that the "successful rountripping" pretty much stopped as soon as the unicode data is exported to somewhere else -- doesn't it contain invalid surrogate sequences? In general, I'm very reluctant to use utf-8b given that it doesn't seem to be well documented as a standard anywhere. Providing some minimal APIs that can process raw-bytes filenames still makes more sense -- it is mostly analogous of our treatment of text files, where the underlying binary data is also accessible.
You seem to forget that (disregarding OSX here, since it already enforces UTF-8) the majority of file names on Posix systems will be encoded correctly.
Apparently under certain circumstances (external FS mounted) OSX can also have non-UTF-8 filenames. [...]
With the filenames decoded by UTF-8, your files named têste, ô, dossié will be displayed and handled correctly. The others are *invalid* in the filesystem encoding UTF-8 and therefore would be represented by something like
u'dir\uXXffname' where XX is some private use Unicode namespace. It won't look pretty when printed, but then, what do other applications do? They e.g. display a question mark as you show above, which is not better in terms of readability.
But it will work when given to a filename-handling function. Valid filenames can be compared to Unicode strings.
A real-world example: OpenOffice can't open files with invalid bytes in their name. They are displayed in the "Open file" dialog, but trying to open fails. This regularly drives me crazy. Let's not make Python not work this way too, or, even worse, not even display those filenames.
How can it *regularly* drive you crazy when "the majority of fie names [...] encoded correctly" (as you assert above)? -- --Guido van Rossum (home page: http://www.python.org/~guido/)