Victor Stinner schrieb:
(Thanks Victor for moving this to the list. Having a discussion in the tracker is really painful, I find.)
POSIX OS --------
The default behaviour should be to use unicode and raise an error if conversion to unicode fails. It should also be possible to use bytes using bytes arguments and optional arguments (for getcwd).
- listdir(unicode) -> unicode and raise an error on invalid filename
I know I keep flipflopping on this one, but the more I think about it the more I believe it is better to drop those names than to raise an exception. Otherwise a "naive" program that happens to use os.listdir() can be rendered completely useless by a single non-UTF-8 filename. Consider the use of os.listdir() by the glob module. If I am globbing for *.py, why should the presence of a file named b'\xff' cause it to fail? Robust programs using os.listdir() should use the bytes->bytes version.
- listdir(bytes) -> bytes - getcwd() -> unicode - getcwd(bytes=True) -> bytes - open(): accept bytes or unicode
os.path.*() should accept operations on bytes filenames, but maybe not on bytes+unicode arguments. os.path.join('directory', b'filename'): raise an error (or use *implicit* conversion to bytes)?
(Yeah, it should be all bytes or all strings.) On Mon, Sep 29, 2008 at 9:45 AM, Georg Brandl <g.brandl@gmx.net> wrote:
This approach (changing all path-handling functions to accept either bytes or string, but not both) is doomed in my eyes. First, there are lots of them, second, they are not only in os.path but in many modules and also in user code, and third, I see no clean way of implementing them in the specified way. (Just try to do it with os.path.join as an example; I couldn't find the good way to write it, only the bad and the ugly...)
It doesn't have to be supported for all operations -- just enough to be able to access all the system calls. and do the most basic pathname manipulations (split and join -- almost everything else can be built out of those).
If I had to choose, I'd still argue for the modified UTF-8 as filesystem encoding (if it were UTF-8 otherwise), despite possible surprises when a such-encoded filename escapes from Python.
I'm having a hard time finding info about UTF-8b. Does anyone have a decent link? I noticed that OSX has a different approach yet. I believe it insists on valid UTF-8 filenames. It may even require some normalization but I don't know if the kernel enforces this. I tried to create a file named b'\xff' and it came out as %ff. Then "rm %ff" worked. So I think it may be replacing all bad UTF8 sequences with their % encoding. The "set filesystem encoding to be Latin-1" approach has a certain charm as well, but clearly would be a mistake on OSX, and probably on other systems too (whenever the user doesn't think in Latin-1). -- --Guido van Rossum (home page: http://www.python.org/~guido/)