MacOSX fully supports unicode filenames (utf-8 is used throughout), and I'm tempted to set Py_FileSystemDefaultEncoding to "utf8" for OSX. Jack pointed me to a long thread about unicode filenames that took place on python-dev last year, but I can't deduce from it whether there are any disadvantages of setting Py_FileSystemDefaultEncoding. Setting it seems to work wonderful. However, I'm a bit surprised that os.listdir() doesn't return unicode strings. Is that because it would break too much code? BTW. if I try to create a file with an 8-bit filename which is _not_ valid utf-8, I get a strange error:
f = open("\xff", "w") Traceback (most recent call last): File "<stdin>", line 1, in ? IOError: invalid mode: w
This exception is thrown when errno is EINVAL, which apparently can also mean that the filename arg is bad. Not sure if we can fix this. Just
MacOSX fully supports unicode filenames (utf-8 is used throughout), and I'm tempted to set Py_FileSystemDefaultEncoding to "utf8" for OSX. Jack pointed me to a long thread about unicode filenames that took place on python-dev last year, but I can't deduce from it whether there are any disadvantages of setting Py_FileSystemDefaultEncoding.
Setting it seems to work wonderful. However, I'm a bit surprised that os.listdir() doesn't return unicode strings. Is that because it would break too much code?
I think that's shallow: the special-casing of unicode_file_names() only exists in the Windows branch of the code.
BTW. if I try to create a file with an 8-bit filename which is _not_ valid utf-8, I get a strange error:
f = open("\xff", "w") Traceback (most recent call last): File "<stdin>", line 1, in ? IOError: invalid mode: w
This exception is thrown when errno is EINVAL, which apparently can also mean that the filename arg is bad. Not sure if we can fix this.
I think we should (maybe we already do) check the mode string more carefully ourselves, and not rely on undocumented correlations between error returns. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
Setting it seems to work wonderful. However, I'm a bit surprised that os.listdir() doesn't return unicode strings. Is that because it would break too much code?
I think that's shallow: the special-casing of unicode_file_names() only exists in the Windows branch of the code.
I've uploaded a tentative patch to posixmodule.c that makes os.listdir() return unicode strings if Py_FileSystemDefaultEncoding is set: http://python.org/sf/683592 I'm not at all sure there's no danger in doing this, but I thought perhaps an actual patch makes discussing this easier. Just
Just van Rossum wrote:
[...] BTW. if I try to create a file with an 8-bit filename which is _not_ valid utf-8, I get a strange error:
f = open("\xff", "w") Traceback (most recent call last): File "<stdin>", line 1, in ? IOError: invalid mode: w
This exception is thrown when errno is EINVAL, which apparently can also mean that the filename arg is bad. Not sure if we can fix this.
But when the system default encoding (i.e. sys.getdefaultencoding()) and the file system encoding are different, I'd say the filename has to be transcoded from the system default encoding to the filesystem encoding before it is used. Bye, Walter Dörwald
Walter Dörwald wrote:
Just van Rossum wrote:
[...] BTW. if I try to create a file with an 8-bit filename which is _not_ valid utf-8, I get a strange error:
f = open("\xff", "w") Traceback (most recent call last): File "<stdin>", line 1, in ? IOError: invalid mode: w
This exception is thrown when errno is EINVAL, which apparently can also mean that the filename arg is bad. Not sure if we can fix this.
But when the system default encoding (i.e. sys.getdefaultencoding()) and the file system encoding are different, I'd say the filename has to be transcoded from the system default encoding to the filesystem encoding before it is used.
In most places (probably all, uness there's a bug) Py_FileSystemDefaultEncoding only has relevance for unicode strings: 8-bit strings are passed to the underlying calls unaltered. So the above traceback is the result of the _OS_ refusing to name a file "\xff", which is natural as this particular OS (OSX) uses UTF-8 as the native file system encoding and "\xff" is not valid UTF-8. (I was actually pleasantly surprised the OS actually _cares_ ;-) Just
Just van Rossum wrote:
Walter Dörwald wrote:
[...]
But when the system default encoding (i.e. sys.getdefaultencoding()) and the file system encoding are different, I'd say the filename has to be transcoded from the system default encoding to the filesystem encoding before it is used.
In most places (probably all, uness there's a bug) Py_FileSystemDefaultEncoding only has relevance for unicode strings: 8-bit strings are passed to the underlying calls unaltered.
That's exactly the problem. Strings passed to open() must always be UTF-8 encoded, so open() is essentially a unicode API. Passing 8bit strings to that function should always go through that unicode API, i.e. the should be treated as any other 8bit string in the unicode context. This means it must be decoded from the default encoding.
So the above traceback is the result of the _OS_ refusing to name a file "\xff", which is natural as this particular OS (OSX) uses UTF-8 as the native file system encoding and "\xff" is not valid UTF-8. (I was actually pleasantly surprised the OS actually _cares_ ;-)
Bye, Walter Dörwald
Walter Dörwald wrote:
But when the system default encoding (i.e. sys.getdefaultencoding()) and the file system encoding are different, I'd say the filename has to be transcoded from the system default encoding to the filesystem encoding before it is used.
In most places (probably all, uness there's a bug) Py_FileSystemDefaultEncoding only has relevance for unicode strings: 8-bit strings are passed to the underlying calls unaltered.
That's exactly the problem. Strings passed to open() must always be UTF-8 encoded, so open() is essentially a unicode API.
(On platforms on which utf-8 is the file system encoding, yes.)
Passing 8bit strings to that function should always go through that unicode API, i.e. the should be treated as any other 8bit string in the unicode context. This means it must be decoded from the default encoding.
Well, that's not how it currently works and changing that will break code. I'm not sure about the rationale of the current semantics, but I assume it has to do with compatibility with non-unicode-aware code. Just
I'm not sure I have followed this completely, but:
(On platforms on which utf-8 is the file system encoding, yes.)
Passing 8bit strings to that function should always go through that unicode API, i.e. the should be treated as any other 8bit string in the unicode context. This means it must be decoded from the default encoding.
The problem is that some file system related functions will return strings *already in* the "file system encoding" - ie, on Windows, some functions will return mbcs encoded filenames. Thus, there is a round-trip problem - if you get a filename from os.listdir(), you could not pass it to open() without lots of head-scratching. The default file system encoding allows you to assume that 8 bit strings passed to open are pre-encoded strings - ie, are likely to have previously come directly from another API function. IIRC, the current rules on Windows are: * Pass a Unicode filename, and Python calls the Unicode versions of the API. * Pass a string, and it is assumed the string is *already* in the default file system encoding, so the string is ont re-encoded. Mark.
participants (4)
-
Guido van Rossum
-
Just van Rossum
-
Mark Hammond
-
Walter Dörwald