[Python-3000] New proposition for Python3 bytes filename issue

Guido van Rossum guido at python.org
Tue Sep 30 15:59:42 CEST 2008


On Mon, Sep 29, 2008 at 11:22 PM, Georg Brandl <g.brandl at gmx.net> wrote:
> No, that was not what I meant (although it is another possibility). As I wrote,
> Martin's proposal that I support here is using the modified UTF-8 codec that
> successfully roundtrips otherwise invalid UTF-8 data.

I thought that the "successful rountripping" pretty much stopped as
soon as the unicode data is exported to somewhere else -- doesn't it
contain invalid surrogate sequences?

In general, I'm very reluctant to use utf-8b given that it doesn't
seem to be well documented as a standard anywhere. Providing some
minimal APIs that can process raw-bytes filenames still makes more
sense -- it is mostly analogous of our treatment of text files, where
the underlying binary data is also accessible.

> You seem to forget that (disregarding OSX here, since it already enforces
> UTF-8) the majority of file names on Posix systems will be encoded correctly.

Apparently under certain circumstances (external FS mounted) OSX can
also have non-UTF-8 filenames.

[...]

> With the filenames decoded by UTF-8, your files named têste, ô, dossié will
> be displayed and handled correctly. The others are *invalid* in the filesystem
> encoding UTF-8 and therefore would be represented by something like
>
> u'dir\uXXffname' where XX is some private use Unicode namespace. It won't look
> pretty when printed, but then, what do other applications do? They e.g. display
> a question mark as you show above, which is not better in terms of readability.
>
> But it will work when given to a filename-handling function. Valid filenames
> can be compared to Unicode strings.
>
> A real-world example: OpenOffice can't open files with invalid bytes in their
> name. They are displayed in the "Open file" dialog, but trying to open fails.
> This regularly drives me crazy. Let's not make Python not work this way too,
> or, even worse, not even display those filenames.

How can it *regularly* drive you crazy when "the majority of fie names
[...] encoded correctly" (as you assert above)?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list