Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

29 Sep 2008

      ...
Victor Stinner schrieb:
(Thanks Victor for moving this to the list. Having a discussion in the
tracker is really painful, I find.)
...
...
POSIX OS
--------
The default behaviour should be to use unicode and raise an error if
conversion to unicode fails. It should also be possible to use bytes using
bytes arguments and optional arguments (for getcwd).
- listdir(unicode) -> unicode and raise an error on invalid filename
I know I keep flipflopping on this one, but the more I think about it
the more I believe it is better to drop those names than to raise an
exception. Otherwise a "naive" program that happens to use
os.listdir() can be rendered completely useless by a single non-UTF-8
filename. Consider the use of os.listdir() by the glob module. If I am
globbing for *.py, why should the presence of a file named b'\xff'
cause it to fail?

Robust programs using os.listdir() should use the bytes->bytes version.
...
...
- listdir(bytes) -> bytes
 - getcwd() -> unicode
 - getcwd(bytes=True) -> bytes
 - open(): accept bytes or unicode
os.path.*() should accept operations on bytes filenames, but maybe not on
bytes+unicode arguments. os.path.join('directory', b'filename'): raise an
error (or use *implicit* conversion to bytes)?
(Yeah, it should be all bytes or all strings.)

On Mon, Sep 29, 2008 at 9:45 AM, Georg Brandl <g.brandl@gmx.net> wrote:
...
This approach (changing all path-handling functions to accept either bytes
or string, but not both) is doomed in my eyes. First, there are lots of them,
second, they are not only in os.path but in many modules and also in user
code, and third, I see no clean way of implementing them in the specified way.
(Just try to do it with os.path.join as an example; I couldn't find the
good way to write it, only the bad and the ugly...)
It doesn't have to be supported for all operations -- just enough to
be able to access all the system calls. and do the most basic pathname
manipulations (split and join -- almost everything else can be built
out of those).
...
If I had to choose, I'd still argue for the modified UTF-8 as filesystem
encoding (if it were UTF-8 otherwise), despite possible surprises when a
such-encoded filename escapes from Python.
I'm having a hard time finding info about UTF-8b. Does anyone have a
decent link?

I noticed that OSX has a different approach yet. I believe it insists
on valid UTF-8 filenames. It may even require some normalization but I
don't know if the kernel enforces this. I tried to create a file named
b'\xff' and it came out as %ff. Then "rm %ff" worked. So I think it
may be replacing all bad UTF8 sequences with their % encoding.

The "set filesystem encoding to be Latin-1" approach has a certain
charm as well, but clearly would be a mistake on OSX, and probably on
other systems too (whenever the user doesn't think in Latin-1).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)