Unicode strings as filenames

What's the correct way to deal with filenames in a Unicode environment? Consider this: >>> import site >>> site.encoding 'latin-1' >>> a = "abc\xe4\xfc\xdf.txt" >>> u = unicode (a, "latin-1") >>> uu = u.encode ("utf-8") >>> open(a, "w") <open file 'abcäüß.txt', mode 'w' at 0x823c2a0> >>> open(u, "w") <open file 'abcäüß.txt', mode 'w' at 0x823a1e8> >>> open(uu, "w") <open file 'abcäüÃ.txt', mode 'w' at 0x81d6160> If I change my site's default encoding back to ascii, the second open fails: >>> import site >>> site.encoding 'ascii' >>> a = "abc\xe4\xfc\xdf.txt" >>> u = unicode (a, "latin-1") >>> uu = u.encode ("utf-8") >>> open(a, "w") <open file 'abcäüß.txt', mode 'w' at 0x822b448> >>> open(u, "w") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) >>> open(uu, "w") <open file 'abcäüÃ.txt', mode 'w' at 0x822d260> as I expect it should. The third open is a problem as well, even though it succeeds with either encoding. (Why doesn't it fail when the default encoding is ascii?) My thought is that before using a plain string or a unicode string as a filename it should first be coerced to a unicode string with the default encoding, something like: if type(fname) == types.StringType: fname = unicode(fname, site.encoding) elif type(fname) == types.UnicodeType: fname = fname.encode(site.encoding) else: raise TypeError, ("unrecognized type for filename: %s"%type(fname)) Is that the correct approach? Apparently Python's file object doesn't do this under the covers. Should it? Thx, Skip

Skip:
On Windows NT/2K/XP the right thing to do is to use the wide char open function such as _CRTIMP FILE * __cdecl _wfopen(const wchar_t *, const wchar_t *); _CRTIMP int __cdecl _wopen(const wchar_t *, int, ...); There may also be techniques for doing this on Windows 9x as the file system stores Unicode file names but I have never looked into this. Neil

Skip> What's the correct way to deal with filenames in a Unicode Skip> environment? Consider this: Skip> [Attempts to use encoding] Neil> On Windows NT/2K/XP the right thing to do is to use the wide char Neil> open function such as Neil> _CRTIMP FILE * __cdecl _wfopen(const wchar_t *, const wchar_t *); Neil> _CRTIMP int __cdecl _wopen(const wchar_t *, int, ...); Neil> There may also be techniques for doing this on Windows 9x as the Neil> file system stores Unicode file names but I have never looked into Neil> this. How is this exposed (if at all) to Python programmers? I happen to be developing on Linux, but the eventual delivery platform will be Windows. Is there no way to handle this in a cross-platform way? Skip

Skip:
How is this exposed (if at all) to Python programmers?
Currently not exposed AFAICT except through calldll.
Cross-platform is tricky as the file systems used on Linux have narrow string file names. Some higher level software (such as the forthcoming version of GTK+/GNOME) assume file names are encoded in UTF-8 but this is a somewhat dangerous assumption. The problem on Windows is that there are files you can not open by performing encoding operations on the Unicode names. They do have narrow generated names, but these are mangled and look like Z8F22~1.HTM so are hard to discover. Neil

I agree. However: - Mark decided to take a different route, using fopen all the time, but encoding Unicode strings with the "mbcs" encoding, which calls MultiByteToWideCharCP with CP_ACP. AFAICT, this is correct as well (although it invokes an unneeded conversion of the string, since fopen, eventually, will convert the string back to Unicode - probably inside CreateFileExA - atleast on WinNT). In any case, passing Unicode objects to open() works just fine, atleast as long as they can be encoded in the ANSI code page. If you want to open a Chinese file name on a Russian Windows installation, you lose. - Skip was likely asking about a Unix installation, in which case all of this is irrelevant.
There may also be techniques for doing this on Windows 9x as the file system stores Unicode file names but I have never looked into this.
To my knowledge, VFAT32 doesn't - only NTFS does (which is not available on W9x). Regards, Martin

Martin:
I want to be able to open all files on my English W2K install and can with many applications even if some have Chinese names and some have Russian. The big advance W2K made over NT was to only have one real version of the OS instead of multiple language versions. There is a system default language as well as local defaults but with just a few clicks my machine can be used as a Japanese machine although as the keyboard keys don't grow Japanese characters, it is a bit harder to use. You do buy localised versions of W2K and XP but they differ in packagng and defaults - the underlying code is identical which was not the case for NT or 9x. Locales are a really poor choice for people who need to operate in multiple languages and much software is moving to allowing concurrent use of multiple languages through the use of Unicode. The term 'multinationalization' (m18n) is sometimes used in Japan to talk about systems that try to avoid restrictions on character set and language.
I have a file called u"C:\\z\u0439\u0446.html" on my W2K FAT partition which displays correctly in the explorer and can be opened in, for example, notepad. This leads to the interesting situation of being able to see a file using glob but not then use it:
Neil

I understand all that, but I can't agree with all your conclusions.
On Windows, locales and Unicode don't contradict each other. You can create files through the locale's code page, and they still end up on disk correctly. This is a much better situation than you have on Unix. In any case, there is no alternative. Locales may be good or bad - you must follow system conventions, if you want to write usable software.
Oops, you are right - the long file name is in Unicode. It is only when you do not have a long file name that the short one is interpreted in OEM encoding.
I agree this is unfortunate; patches are welcome. Please notice that the strategy of using wchar_t API on Windows has explicitly been considered and rejected, for the complexity of the code changes involved. So anybody proposing a patch would need to make it both useful, and easy to maintain. With these constraints, the current implementation is the best thing Mark could come up with. Software always has limitations, which are removed only if somebody is bothered so much as to change the software. Regards, Martin

Martin:
Sure, I'm just putting my point of view which appears to be different from most in that many developers just use a single locale. If I had a larger supply of time then I'd eventually work on this but there are other tasks that currently look like having more impact. The system provided scripting languages support wide character file names. in VBScript: Set fso = CreateObject("Scripting.FileSystemObject") crlf = chr(13) & chr(10) For Each f1 in fso.GetFolder("C:\").Files if instr(1, f1.name, ".htm") > 0 then s = s & f1.Path & crlf if left(f1.name, 1) = "z" then fo = fso.OpenTextFile(f1.Path).ReadAll() s = s & fo & crlf end if end if Next MsgBox s And Python with the win32 extensions can do the same using the FileSystemObject: # encode used here just to make things print as a quick demo import win32com fso = win32com.client.Dispatch("Scripting.FileSystemObject") s = "" fol = fso.GetFolder("C:\\") for f1 in fol.Files: if f1.name.find(".htm") > 0: s += f1.Path.encode("UTF-8") + "\r\n" if f1.name[0] == u"z": fo = fso.OpenTextFile(f1.Path).ReadAll() s += fo.encode("UTF-8") + "\r\n" print s Neil

The system provided scripting languages support wide character file names.
Please understand that Python also supports wide character file names. It just doesn't allow all the possible values that the system would allow.
For Each f1 in fso.GetFolder("C:\").Files
That, of course, is another important difference: Here you get the directory contents as wide strings. Changing os.listdir to return Unicode objects would be possible, but would likely introduce a number of incompatibilities. Your script (e.g. the Python variant) is prepared that .Files returns Unicode objects. Making the same change in Python on all functions that return file names (i.e. listdir, glob, etc) is difficult - most likely, you'll have to make the return type a choice of the application. Regards, Martin

[Skip wants open() to handle Unicode on all platforms] As Martin and Neil have already explained, the handling of national characters in file names is not standardized at all across platforms (not even file systems on one platform, e.g. on Linux). The only option I see to make this situation less painful is to write a filename subsystem which implements two generic APIs: 1. file open using strings and Unicode 2. file listing using either Unicode or strings with a predefined encoding in the output list Since this subsystem would be fairly complicated, I'd suggest that someone writes a PEP on the topic and then the various experts try to come up with implementations which work on at least some systems and a fallback implementation which gets used if no other implementation fits. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

I think this "pretty much" works in Python 2.2 already. It uses the "mbcs" encoding on Windows, and the locale's encoding on Unix if locale.setlocale has been called (and the C library is good enough). That might be still wrong if the file system expects UTF-8, or a fixed encoding (e.g. on an NTFS or VFAT partition mounted on Linux), but I don't think there is anything that can be done about this: It would be a misconfigured system if then the user doesn't also use an UTF-8 locale.
2. file listing using either Unicode or strings with a predefined encoding in the output list
That is something that certainly needs to be done. Having a PEP on that would be useful. Regards, Martin

"Martin v. Loewis" wrote:
We'd still need to support other OSes as well, though, and I don't think that putting all this code into fileobject.c is a good idea -- after all opening files is needed by some other parts of Python as well and may also be useful for extensions. I'd suggest to implement something similiar to the DLL loading code which is also implemented as subsystem in Python.
Yep. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

The stuff isn't in fileobject.c. Py_FileSystemDefaultEncoding is defined in bltinmodule.c. Also, on other OSes: You can pass Unicode object to open on all systems. If Py_FileSystemDefaultEncoding is NULL, it will fall back to site.encoding. Of course, if the system has an open function that expects wchar_t*, we might want to use that instead of going through a codec. Off hand, Win32 seems to be the only system where this might work, and even there, it won't work on Win95.
I'd suggest to implement something similiar to the DLL loading code which is also implemented as subsystem in Python.
I'd say this is over-designed. It is not that there are ten alternative approaches to doing encodings in file names, and we only support two of them, but it is rather that there are only two, and we support all three of them :-) Also, it is more difficult than threads: for threads, there is a fixed set of API features that need to be represented. Doing Py_UNICODE* opening alone is easy, but look at the number of posixmodule functions that all expect file names of some sort. Regards, Martin

Martin v. Loewis wrote:
That's the global, sure but the code using it is scattered across fileobject.c and the posix module. I think it would be a good idea to put all this file naming code into some Python/fileapi.c file which then also provides C APIs for extensions to use. These APIs should then take the file name as PyObject* rather than char* to enable them to handle Unicode directly.
I expect this to become a standard in the next few years.
Doesn't that support the idea of having a small subsystem in Python which exposes the Unicode aware APIs to Python and its extensions ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

What do you gain by that? Most of the posixmodule functions that take filenames are direct wrappers around the system call. Using another level of indirection is only useful if the fileapi.c functions are used in different places. Notice that each function (open, access, stat, etc) is used exactly *once* currently, so putting this all into a single place just makes the code more complex. The extensions module argument is a red herring: I don't think there are many extension modules out there which want to call access(2) but would like to do so using a PyObject* as the first argument, but numbers as the other arguments.
I doubt that. Posix people (including developers of various posixish systems) have frequently rejected that idea in recent years. Even for the most recent system in this respect (OS X), we hear that they still open files with a char*, where char is byte - the only advancement is that there is a guarantee that those bytes are UTF-8. It turns out that this is all you need: with that guarantee, there is no need for an additional set of APIs. UTF-8 was originally invented precisely to represent file names (and was called UTF-1 at that time); it is more likely that more systems will follow this convention. If so, a global per-system file system encoding is all that's needed. The only problem is that on Windows, MS has already decided that the APIs are in CP_ANSI, so they cannot change it to UTF-8 now; that's why Windows will need special casing if people are unhappy with the "mbcs" approach (which some apparantly are).
No. It is a lot of work, and an additional layer of indirection, with no apparent advantage. Feel free to write a PEP, though. Regards, Martin

Explored the possibility of detecting Unicode arguments to open and using _wfopen on Windows NT. This led to trying to store Unicode strings in the f_name and f_mode fields of the file object which started to escalate into complexity making Mark's mbcs choice more understandable. Another approach is to use utf-8 as the Py_FileSystemDefaultEncoding and then convert to and from in each file system access function. The core file open function from fileobject.c changed to work with utf-8 is at the end of this message with the important lines in the #ifdef MS_WIN32 section. Along with that change goes a change in Py_FileSystemDefaultEncoding to be "utf-8" rather than "mbcs". This change works for me on Windows 2000 and allows access to all files no matter what the current code page is set to. On Windows 9x (not yet tested), the _wfopen call should fail causing a fallback to fopen. Possibly the OS should be detected instead and _wfopen not attempted on 9x. On 9x, mbcs may be a better choice of encoding although it may also be possible to ask the file system to find the wide character file name and return the mangled short name that can then be used by fopen. The best approach to me seems to be to make Py_FileSystemDefaultEncoding settable by the user, at least allowing the choice between 'utf-8' and 'mbcs' with a default of 'utf-8' on NT and 'mbcs' on 9x. This approach can be extended to other file system calls with, for example, os.listdir and glob.glob upon detecting a utf-8 default encoding, using wide character system calls and converting to utf-8. Please criticise any stylistic or correctness issues in the code as it is my first modification to the Python sources. Neil static PyObject * open_the_file(PyFileObject *f, char *name, char *mode) { assert(f != NULL); assert(PyFile_Check(f)); assert(name != NULL); assert(mode != NULL); assert(f->f_fp == NULL); /* rexec.py can't stop a user from getting the file() constructor -- all they have to do is get *any* file object f, and then do type(f). Here we prevent them from doing damage with it. */ if (PyEval_GetRestricted()) { PyErr_SetString(PyExc_IOError, "file() constructor not accessible in restricted mode"); return NULL; } errno = 0; #ifdef HAVE_FOPENRF if (*mode == '*') { FILE *fopenRF(); f->f_fp = fopenRF(name, mode+1); } else #endif { Py_BEGIN_ALLOW_THREADS #ifdef MS_WIN32 if (strcmp(Py_FileSystemDefaultEncoding, "utf-8") == 0) { PyObject *wname; PyObject *wmode; wname = PyUnicode_DecodeUTF8(name, strlen(name), "strict"); wmode = PyUnicode_DecodeUTF8(mode, strlen(mode), "strict"); if (wname && wmode) { f->f_fp = _wfopen(PyUnicode_AS_UNICODE(wname), PyUnicode_AS_UNICODE(wmode)); } Py_XDECREF(wname); Py_XDECREF(wmode); } if (NULL == f->f_fp) { f->f_fp = fopen(name, mode); } #else f->f_fp = fopen(name, mode); #endif Py_END_ALLOW_THREADS } if (f->f_fp == NULL) { #ifdef NO_FOPEN_ERRNO /* Metroworks only, wich does not always sets errno */ if (errno == 0) { PyObject *v; v = Py_BuildValue("(is)", 0, "Cannot open file"); if (v != NULL) { PyErr_SetObject(PyExc_IOError, v); Py_DECREF(v); } return NULL; } #endif if (errno == EINVAL) PyErr_Format(PyExc_IOError, "invalid argument: %s", mode); else PyErr_SetFromErrnoWithFilename(PyExc_IOError, name); f = NULL; } return (PyObject *)f; }

Now that you have that change, please try to extend it to posixmodule.c. This is where I gave up. Notice that, with changing Py_FileSystemDefaultEncoding and open() alone, you have worsened the situation: os.stat will now fail on files with non-ASCII names on which it works under the mbcs encoding, because windows won't find the file (correct me if I'm wrong).
It is not just 9x: if you have ten (*) different APIs to open a file, 10 different APIs to stat a file, and so on, and have to select some of them at compile time, and some of them at run-time, it gets messy very quickly. (*) I'd expect that other systems may also have proprietary system calls to do these things, using either wchar_t* or a proprietary Unicode type.
By the user, or by the application? How can the application make a more educated guess than Python proper? Alternatively, how can the user (or her Administrator) know what value to put in there? On Windows, probably neither is a good idea; if the file system default encoding is used in the future, fixing it at mbcs is the best I can think of.
Please criticise any stylistic or correctness issues in the code as it is my first modification to the Python sources.
The code looks fine. I'd encourage you to continue on that topic; just expect that it will need many more rounds for completion. Regards, Martin

Martin v. Loewis:
Now that you have that change, please try to extend it to posixmodule.c. This is where I gave up.
OK. os.open, os.stat, and os.listdir now work. Placed temporarily at http://pythoncard.sourceforge.net/posixmodule.c os.stat is ugly because the posix_do_stat function is parameterised over a stat function pointer but it is always _stati64 on Windows so the patch just assumes _wstati64 is right. os.listdir returns Unicode objects rather than strings. This makes glob.glob work as well so my earlier script that finds the *.html files and opens them works. Unfortunately, I expect most callers of glob() will be expecting narrow strings.
If you give it a file name encoded in the current code page then it may fail where it did not before. Neil

Looks good. The posix_do_stat changes contain an error; you have put Python API calls inside the BEGIN_ALLOW_THREADS block. That is wrong: you must always hold the interpreter lock when calling Python API. Also, when calling _wstati64, you might want to assert that the function pointer is _stati64. Likewise, the code inside posix_open should hold the interpreter lock.
That is not that much of a problem; we could try to define API where it is the caller's choice. However, the size of your changes is really disturbing here. There used to be already four versions of listing a directory; now you've added a fifth one. And it isn't even clear whether this code works on W9x, is it? There must be a way to fold the different Windows versions into a single one; perhaps it is acceptable to drop Win16 support. I think three different versions should be offered to the end user: - path is plain string, result is list of plain strings - path is Unicode string, result is list of Unicode strings - path is Unicode string, result is list of plain strings Perhaps one could argue that the third version isn't really needed: anybody passing Unicode strings to listdir should be expected to get them back also. That would leave us with two functional features on windows. I envision a fragment that looks like this #ifdef windows if (argument is unicode string) { #define strings wide #include "listdir_win.h" #undef strings } else { convert argument to string #define strings narrow #include "listdir_win.h" #undef strings #endif If you provide a similar listdir_posix and listdir_os2, it should be possible to get a uniform implementation.
I was actually talking about stat as a function that you haven't touched, yet. Now, os.rename will fail if you pass two Unicode strings referring to non-ASCII file names. posix_1str and posix_2str are like the stat implementation, except that you cannot know statically what the function pointer is. Regards, Martin

Marc-Andre Lemburg:
I started work on this in C++ for my SciTE editor a couple of months ago but the design started to include stuff like 'are these two paths pointing at one file', converting between OpenVMS and Unix paths, and handling URLs (at least using ftp and http). My brain threatened to explode if it got any more complex so it got moved to the 'future niceness' pile. Neil

Neil Hodgson wrote:
I believe that we could do well with the following assumptions: a) strings passed to open() use whatever encoding is needed by the file system b) Unicode passed to open() are converted to whatever the file system needs by then open() API. This doesn't cover all the possibilities, but goes a long way. Joining paths between file systems should really be left to the os.path APIs. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Setting site.encoding is certainly the wrong thing to do. How can you know all users of your system use latin-1?
On my system, the following works fine
On Unix, your best bet for file names is to trust the user's locale settings. If you do that, open will accept Unicode objects. What is your locale?
Is that the correct approach? Apparently Python's file object doesn't do this under the covers. Should it?
No. There is no established convention, on Unix, how to do non-ASCII file names. If anything, following the user's locale setting is the most reasonable thing to do; this should be in synch of how the user's terminal displays characters. The Python installations' default encoding is almost useless, and shouldn't be changed. On Windows, things are much better, since there a notion of Unicode file names in the system. Regards, Martin

"Martin" == Martin v Loewis <martin@v.loewis.de> writes:
>> What's the correct way to deal with filenames in a Unicode >> environment? Consider this: >> >> >>> import site site.encoding >> 'latin-1' Martin> Setting site.encoding is certainly the wrong thing to do. How Martin> can you know all users of your system use latin-1? Why is setting site.encoding appropriate to your environment at the time you install Python wrong? I can't know that all users of my system (whatever the definition of "my system" is) will use latin-1. Somewhere along the way I have to make some assumptions, however. On any given computer I assume the people who install Python will set site.encoding appropriate to their environment. The example I used was latin-1 simply because the folks I'm working with are in Austria and they came up with the example. I assume the best default encoding for them is latin-1. The application writers themselves will have no problem restricting internal filenames to be ascii. I assume it users want to save files of their own, they will choose characters from the Unicode character set they use most frequently. So, my example used latin-1. I could just as easily have chosen something else. Martin> On my system, the following works fine Martin> >>> import locale ; locale.setlocale(locale.LC_ALL,"") Martin> 'LC_CTYPE=de_DE;LC_NUMERIC=de_DE;LC_TIME=de_DE;LC_COLLATE=C;LC_MONETARY=de_DE;LC_MESSAGES=de_DE;LC_PAPER=de_DE;LC_NAME=de_DE;LC_ADDRESS=de_DE;LC_TELEPHONE=de_DE;LC_MEASUREMENT=de_DE;LC_IDENTIFICATION=de_DE' Martin> >>> a = "abc\xe4\xfc\xdf.txt" u = unicode (a, "latin-1") open(u, "w") Martin> <open file 'abcäüß.txt', mode 'w' at 0x8173e88> Martin> On Unix, your best bet for file names is to trust the user's Martin> locale settings. If you do that, open will accept Unicode Martin> objects. Martin> What is your locale? The above setlocale call prints 'LC_CTYPE=en_US;LC_NUMERIC=en_US;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en;LC_NAME=en;LC_ADDRESS=en;LC_TELEPHONE=en;LC_MEASUREMENT=en;LC_IDENTIFICATION=en' I can't get to the machines in Austria right now to see how their locales are set, though I suspect they haven't fiddled their LC_* environment, because they are having the problems I described. >> Is that the correct approach? Apparently Python's file object >> doesn't do this under the covers. Should it? Martin> No. There is no established convention, on Unix, how to do Martin> non-ASCII file names. If anything, following the user's locale Martin> setting is the most reasonable thing to do; this should be in Martin> synch of how the user's terminal displays characters. The Python Martin> installations' default encoding is almost useless, and shouldn't Martin> be changed. Martin> On Windows, things are much better, since there a notion of Martin> Unicode file names in the system. This suggests to me that the Python docs need some introductory material on this topic. It appears to me that there are two people in the Python community who live and breathe this stuff are you, Martin, and Marc-André. For most of the rest of us, especially if we've never conciously written code for consumption outside an ascii environment, the whole thing just looks like a quagmire. Skip

Well, then accept the assumption that almost everybody will use an ASCII superset. That may be still wrong, for the case of EBCDIC users, but those are rare on Unix. However, on our typical Unix system, three different encodings are in use: ISO-8859-1 (for tradition), ISO-8859-15 (because it has the Euro), and UTF-8 (because it removes all the limitations). Notice that all of our users speak German, and we still could not set a meaningful site.encoding except for 'ascii'.
On any given computer I assume the people who install Python will set site.encoding appropriate to their environment.
That is probably wrong. Most users will install precompiled packages, and thus site.py will have the value that the package held, which will be 'ascii' for most packages.
Well, latin-1 does not have a Euro sign, which may be more and more of a problem.
That is a meaningful assumption. However, it is one that you have to make in your application, not one that you should users expect to make in their Python installations.
The above setlocale call prints
'LC_CTYPE=en_US;LC_NUMERIC=en_US;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en;LC_NAME=en;LC_ADDRESS=en;LC_TELEPHONE=en;LC_MEASUREMENT=en;LC_IDENTIFICATION=en'
You may want to extend your system to support the same configuration that your users have, i.e. you might want to install an Austrian locale on your system, and set LANG to de_AT. If your system also sets all the LC_ variables for you, I recommend to unset them - setting LANG is enough (to override all other LC_ variables, setting LC_ALL to de_AT should also work).
If if they set the environment variables, they'd still have the problem because your application doesn't call setlocale. I do expect that they have set LANG to de_AT, or de_AT.ISO-8859-1. Perhaps they also have this problem because they use Python 2.1 or earlier.
Well, I'd happily review any introductory material somebody else writes :-) Regards, Martin

Skip:
On Windows NT/2K/XP the right thing to do is to use the wide char open function such as _CRTIMP FILE * __cdecl _wfopen(const wchar_t *, const wchar_t *); _CRTIMP int __cdecl _wopen(const wchar_t *, int, ...); There may also be techniques for doing this on Windows 9x as the file system stores Unicode file names but I have never looked into this. Neil

Skip> What's the correct way to deal with filenames in a Unicode Skip> environment? Consider this: Skip> [Attempts to use encoding] Neil> On Windows NT/2K/XP the right thing to do is to use the wide char Neil> open function such as Neil> _CRTIMP FILE * __cdecl _wfopen(const wchar_t *, const wchar_t *); Neil> _CRTIMP int __cdecl _wopen(const wchar_t *, int, ...); Neil> There may also be techniques for doing this on Windows 9x as the Neil> file system stores Unicode file names but I have never looked into Neil> this. How is this exposed (if at all) to Python programmers? I happen to be developing on Linux, but the eventual delivery platform will be Windows. Is there no way to handle this in a cross-platform way? Skip

Skip:
How is this exposed (if at all) to Python programmers?
Currently not exposed AFAICT except through calldll.
Cross-platform is tricky as the file systems used on Linux have narrow string file names. Some higher level software (such as the forthcoming version of GTK+/GNOME) assume file names are encoded in UTF-8 but this is a somewhat dangerous assumption. The problem on Windows is that there are files you can not open by performing encoding operations on the Unicode names. They do have narrow generated names, but these are mangled and look like Z8F22~1.HTM so are hard to discover. Neil

I agree. However: - Mark decided to take a different route, using fopen all the time, but encoding Unicode strings with the "mbcs" encoding, which calls MultiByteToWideCharCP with CP_ACP. AFAICT, this is correct as well (although it invokes an unneeded conversion of the string, since fopen, eventually, will convert the string back to Unicode - probably inside CreateFileExA - atleast on WinNT). In any case, passing Unicode objects to open() works just fine, atleast as long as they can be encoded in the ANSI code page. If you want to open a Chinese file name on a Russian Windows installation, you lose. - Skip was likely asking about a Unix installation, in which case all of this is irrelevant.
There may also be techniques for doing this on Windows 9x as the file system stores Unicode file names but I have never looked into this.
To my knowledge, VFAT32 doesn't - only NTFS does (which is not available on W9x). Regards, Martin

Martin:
I want to be able to open all files on my English W2K install and can with many applications even if some have Chinese names and some have Russian. The big advance W2K made over NT was to only have one real version of the OS instead of multiple language versions. There is a system default language as well as local defaults but with just a few clicks my machine can be used as a Japanese machine although as the keyboard keys don't grow Japanese characters, it is a bit harder to use. You do buy localised versions of W2K and XP but they differ in packagng and defaults - the underlying code is identical which was not the case for NT or 9x. Locales are a really poor choice for people who need to operate in multiple languages and much software is moving to allowing concurrent use of multiple languages through the use of Unicode. The term 'multinationalization' (m18n) is sometimes used in Japan to talk about systems that try to avoid restrictions on character set and language.
I have a file called u"C:\\z\u0439\u0446.html" on my W2K FAT partition which displays correctly in the explorer and can be opened in, for example, notepad. This leads to the interesting situation of being able to see a file using glob but not then use it:
Neil

I understand all that, but I can't agree with all your conclusions.
On Windows, locales and Unicode don't contradict each other. You can create files through the locale's code page, and they still end up on disk correctly. This is a much better situation than you have on Unix. In any case, there is no alternative. Locales may be good or bad - you must follow system conventions, if you want to write usable software.
Oops, you are right - the long file name is in Unicode. It is only when you do not have a long file name that the short one is interpreted in OEM encoding.
I agree this is unfortunate; patches are welcome. Please notice that the strategy of using wchar_t API on Windows has explicitly been considered and rejected, for the complexity of the code changes involved. So anybody proposing a patch would need to make it both useful, and easy to maintain. With these constraints, the current implementation is the best thing Mark could come up with. Software always has limitations, which are removed only if somebody is bothered so much as to change the software. Regards, Martin

Martin:
Sure, I'm just putting my point of view which appears to be different from most in that many developers just use a single locale. If I had a larger supply of time then I'd eventually work on this but there are other tasks that currently look like having more impact. The system provided scripting languages support wide character file names. in VBScript: Set fso = CreateObject("Scripting.FileSystemObject") crlf = chr(13) & chr(10) For Each f1 in fso.GetFolder("C:\").Files if instr(1, f1.name, ".htm") > 0 then s = s & f1.Path & crlf if left(f1.name, 1) = "z" then fo = fso.OpenTextFile(f1.Path).ReadAll() s = s & fo & crlf end if end if Next MsgBox s And Python with the win32 extensions can do the same using the FileSystemObject: # encode used here just to make things print as a quick demo import win32com fso = win32com.client.Dispatch("Scripting.FileSystemObject") s = "" fol = fso.GetFolder("C:\\") for f1 in fol.Files: if f1.name.find(".htm") > 0: s += f1.Path.encode("UTF-8") + "\r\n" if f1.name[0] == u"z": fo = fso.OpenTextFile(f1.Path).ReadAll() s += fo.encode("UTF-8") + "\r\n" print s Neil

The system provided scripting languages support wide character file names.
Please understand that Python also supports wide character file names. It just doesn't allow all the possible values that the system would allow.
For Each f1 in fso.GetFolder("C:\").Files
That, of course, is another important difference: Here you get the directory contents as wide strings. Changing os.listdir to return Unicode objects would be possible, but would likely introduce a number of incompatibilities. Your script (e.g. the Python variant) is prepared that .Files returns Unicode objects. Making the same change in Python on all functions that return file names (i.e. listdir, glob, etc) is difficult - most likely, you'll have to make the return type a choice of the application. Regards, Martin

[Skip wants open() to handle Unicode on all platforms] As Martin and Neil have already explained, the handling of national characters in file names is not standardized at all across platforms (not even file systems on one platform, e.g. on Linux). The only option I see to make this situation less painful is to write a filename subsystem which implements two generic APIs: 1. file open using strings and Unicode 2. file listing using either Unicode or strings with a predefined encoding in the output list Since this subsystem would be fairly complicated, I'd suggest that someone writes a PEP on the topic and then the various experts try to come up with implementations which work on at least some systems and a fallback implementation which gets used if no other implementation fits. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

I think this "pretty much" works in Python 2.2 already. It uses the "mbcs" encoding on Windows, and the locale's encoding on Unix if locale.setlocale has been called (and the C library is good enough). That might be still wrong if the file system expects UTF-8, or a fixed encoding (e.g. on an NTFS or VFAT partition mounted on Linux), but I don't think there is anything that can be done about this: It would be a misconfigured system if then the user doesn't also use an UTF-8 locale.
2. file listing using either Unicode or strings with a predefined encoding in the output list
That is something that certainly needs to be done. Having a PEP on that would be useful. Regards, Martin

"Martin v. Loewis" wrote:
We'd still need to support other OSes as well, though, and I don't think that putting all this code into fileobject.c is a good idea -- after all opening files is needed by some other parts of Python as well and may also be useful for extensions. I'd suggest to implement something similiar to the DLL loading code which is also implemented as subsystem in Python.
Yep. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

The stuff isn't in fileobject.c. Py_FileSystemDefaultEncoding is defined in bltinmodule.c. Also, on other OSes: You can pass Unicode object to open on all systems. If Py_FileSystemDefaultEncoding is NULL, it will fall back to site.encoding. Of course, if the system has an open function that expects wchar_t*, we might want to use that instead of going through a codec. Off hand, Win32 seems to be the only system where this might work, and even there, it won't work on Win95.
I'd suggest to implement something similiar to the DLL loading code which is also implemented as subsystem in Python.
I'd say this is over-designed. It is not that there are ten alternative approaches to doing encodings in file names, and we only support two of them, but it is rather that there are only two, and we support all three of them :-) Also, it is more difficult than threads: for threads, there is a fixed set of API features that need to be represented. Doing Py_UNICODE* opening alone is easy, but look at the number of posixmodule functions that all expect file names of some sort. Regards, Martin

Martin v. Loewis wrote:
That's the global, sure but the code using it is scattered across fileobject.c and the posix module. I think it would be a good idea to put all this file naming code into some Python/fileapi.c file which then also provides C APIs for extensions to use. These APIs should then take the file name as PyObject* rather than char* to enable them to handle Unicode directly.
I expect this to become a standard in the next few years.
Doesn't that support the idea of having a small subsystem in Python which exposes the Unicode aware APIs to Python and its extensions ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

What do you gain by that? Most of the posixmodule functions that take filenames are direct wrappers around the system call. Using another level of indirection is only useful if the fileapi.c functions are used in different places. Notice that each function (open, access, stat, etc) is used exactly *once* currently, so putting this all into a single place just makes the code more complex. The extensions module argument is a red herring: I don't think there are many extension modules out there which want to call access(2) but would like to do so using a PyObject* as the first argument, but numbers as the other arguments.
I doubt that. Posix people (including developers of various posixish systems) have frequently rejected that idea in recent years. Even for the most recent system in this respect (OS X), we hear that they still open files with a char*, where char is byte - the only advancement is that there is a guarantee that those bytes are UTF-8. It turns out that this is all you need: with that guarantee, there is no need for an additional set of APIs. UTF-8 was originally invented precisely to represent file names (and was called UTF-1 at that time); it is more likely that more systems will follow this convention. If so, a global per-system file system encoding is all that's needed. The only problem is that on Windows, MS has already decided that the APIs are in CP_ANSI, so they cannot change it to UTF-8 now; that's why Windows will need special casing if people are unhappy with the "mbcs" approach (which some apparantly are).
No. It is a lot of work, and an additional layer of indirection, with no apparent advantage. Feel free to write a PEP, though. Regards, Martin

Explored the possibility of detecting Unicode arguments to open and using _wfopen on Windows NT. This led to trying to store Unicode strings in the f_name and f_mode fields of the file object which started to escalate into complexity making Mark's mbcs choice more understandable. Another approach is to use utf-8 as the Py_FileSystemDefaultEncoding and then convert to and from in each file system access function. The core file open function from fileobject.c changed to work with utf-8 is at the end of this message with the important lines in the #ifdef MS_WIN32 section. Along with that change goes a change in Py_FileSystemDefaultEncoding to be "utf-8" rather than "mbcs". This change works for me on Windows 2000 and allows access to all files no matter what the current code page is set to. On Windows 9x (not yet tested), the _wfopen call should fail causing a fallback to fopen. Possibly the OS should be detected instead and _wfopen not attempted on 9x. On 9x, mbcs may be a better choice of encoding although it may also be possible to ask the file system to find the wide character file name and return the mangled short name that can then be used by fopen. The best approach to me seems to be to make Py_FileSystemDefaultEncoding settable by the user, at least allowing the choice between 'utf-8' and 'mbcs' with a default of 'utf-8' on NT and 'mbcs' on 9x. This approach can be extended to other file system calls with, for example, os.listdir and glob.glob upon detecting a utf-8 default encoding, using wide character system calls and converting to utf-8. Please criticise any stylistic or correctness issues in the code as it is my first modification to the Python sources. Neil static PyObject * open_the_file(PyFileObject *f, char *name, char *mode) { assert(f != NULL); assert(PyFile_Check(f)); assert(name != NULL); assert(mode != NULL); assert(f->f_fp == NULL); /* rexec.py can't stop a user from getting the file() constructor -- all they have to do is get *any* file object f, and then do type(f). Here we prevent them from doing damage with it. */ if (PyEval_GetRestricted()) { PyErr_SetString(PyExc_IOError, "file() constructor not accessible in restricted mode"); return NULL; } errno = 0; #ifdef HAVE_FOPENRF if (*mode == '*') { FILE *fopenRF(); f->f_fp = fopenRF(name, mode+1); } else #endif { Py_BEGIN_ALLOW_THREADS #ifdef MS_WIN32 if (strcmp(Py_FileSystemDefaultEncoding, "utf-8") == 0) { PyObject *wname; PyObject *wmode; wname = PyUnicode_DecodeUTF8(name, strlen(name), "strict"); wmode = PyUnicode_DecodeUTF8(mode, strlen(mode), "strict"); if (wname && wmode) { f->f_fp = _wfopen(PyUnicode_AS_UNICODE(wname), PyUnicode_AS_UNICODE(wmode)); } Py_XDECREF(wname); Py_XDECREF(wmode); } if (NULL == f->f_fp) { f->f_fp = fopen(name, mode); } #else f->f_fp = fopen(name, mode); #endif Py_END_ALLOW_THREADS } if (f->f_fp == NULL) { #ifdef NO_FOPEN_ERRNO /* Metroworks only, wich does not always sets errno */ if (errno == 0) { PyObject *v; v = Py_BuildValue("(is)", 0, "Cannot open file"); if (v != NULL) { PyErr_SetObject(PyExc_IOError, v); Py_DECREF(v); } return NULL; } #endif if (errno == EINVAL) PyErr_Format(PyExc_IOError, "invalid argument: %s", mode); else PyErr_SetFromErrnoWithFilename(PyExc_IOError, name); f = NULL; } return (PyObject *)f; }

Now that you have that change, please try to extend it to posixmodule.c. This is where I gave up. Notice that, with changing Py_FileSystemDefaultEncoding and open() alone, you have worsened the situation: os.stat will now fail on files with non-ASCII names on which it works under the mbcs encoding, because windows won't find the file (correct me if I'm wrong).
It is not just 9x: if you have ten (*) different APIs to open a file, 10 different APIs to stat a file, and so on, and have to select some of them at compile time, and some of them at run-time, it gets messy very quickly. (*) I'd expect that other systems may also have proprietary system calls to do these things, using either wchar_t* or a proprietary Unicode type.
By the user, or by the application? How can the application make a more educated guess than Python proper? Alternatively, how can the user (or her Administrator) know what value to put in there? On Windows, probably neither is a good idea; if the file system default encoding is used in the future, fixing it at mbcs is the best I can think of.
Please criticise any stylistic or correctness issues in the code as it is my first modification to the Python sources.
The code looks fine. I'd encourage you to continue on that topic; just expect that it will need many more rounds for completion. Regards, Martin

Martin v. Loewis:
Now that you have that change, please try to extend it to posixmodule.c. This is where I gave up.
OK. os.open, os.stat, and os.listdir now work. Placed temporarily at http://pythoncard.sourceforge.net/posixmodule.c os.stat is ugly because the posix_do_stat function is parameterised over a stat function pointer but it is always _stati64 on Windows so the patch just assumes _wstati64 is right. os.listdir returns Unicode objects rather than strings. This makes glob.glob work as well so my earlier script that finds the *.html files and opens them works. Unfortunately, I expect most callers of glob() will be expecting narrow strings.
If you give it a file name encoded in the current code page then it may fail where it did not before. Neil

Looks good. The posix_do_stat changes contain an error; you have put Python API calls inside the BEGIN_ALLOW_THREADS block. That is wrong: you must always hold the interpreter lock when calling Python API. Also, when calling _wstati64, you might want to assert that the function pointer is _stati64. Likewise, the code inside posix_open should hold the interpreter lock.
That is not that much of a problem; we could try to define API where it is the caller's choice. However, the size of your changes is really disturbing here. There used to be already four versions of listing a directory; now you've added a fifth one. And it isn't even clear whether this code works on W9x, is it? There must be a way to fold the different Windows versions into a single one; perhaps it is acceptable to drop Win16 support. I think three different versions should be offered to the end user: - path is plain string, result is list of plain strings - path is Unicode string, result is list of Unicode strings - path is Unicode string, result is list of plain strings Perhaps one could argue that the third version isn't really needed: anybody passing Unicode strings to listdir should be expected to get them back also. That would leave us with two functional features on windows. I envision a fragment that looks like this #ifdef windows if (argument is unicode string) { #define strings wide #include "listdir_win.h" #undef strings } else { convert argument to string #define strings narrow #include "listdir_win.h" #undef strings #endif If you provide a similar listdir_posix and listdir_os2, it should be possible to get a uniform implementation.
I was actually talking about stat as a function that you haven't touched, yet. Now, os.rename will fail if you pass two Unicode strings referring to non-ASCII file names. posix_1str and posix_2str are like the stat implementation, except that you cannot know statically what the function pointer is. Regards, Martin

Marc-Andre Lemburg:
I started work on this in C++ for my SciTE editor a couple of months ago but the design started to include stuff like 'are these two paths pointing at one file', converting between OpenVMS and Unix paths, and handling URLs (at least using ftp and http). My brain threatened to explode if it got any more complex so it got moved to the 'future niceness' pile. Neil

Neil Hodgson wrote:
I believe that we could do well with the following assumptions: a) strings passed to open() use whatever encoding is needed by the file system b) Unicode passed to open() are converted to whatever the file system needs by then open() API. This doesn't cover all the possibilities, but goes a long way. Joining paths between file systems should really be left to the os.path APIs. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Setting site.encoding is certainly the wrong thing to do. How can you know all users of your system use latin-1?
On my system, the following works fine
On Unix, your best bet for file names is to trust the user's locale settings. If you do that, open will accept Unicode objects. What is your locale?
Is that the correct approach? Apparently Python's file object doesn't do this under the covers. Should it?
No. There is no established convention, on Unix, how to do non-ASCII file names. If anything, following the user's locale setting is the most reasonable thing to do; this should be in synch of how the user's terminal displays characters. The Python installations' default encoding is almost useless, and shouldn't be changed. On Windows, things are much better, since there a notion of Unicode file names in the system. Regards, Martin

"Martin" == Martin v Loewis <martin@v.loewis.de> writes:
>> What's the correct way to deal with filenames in a Unicode >> environment? Consider this: >> >> >>> import site site.encoding >> 'latin-1' Martin> Setting site.encoding is certainly the wrong thing to do. How Martin> can you know all users of your system use latin-1? Why is setting site.encoding appropriate to your environment at the time you install Python wrong? I can't know that all users of my system (whatever the definition of "my system" is) will use latin-1. Somewhere along the way I have to make some assumptions, however. On any given computer I assume the people who install Python will set site.encoding appropriate to their environment. The example I used was latin-1 simply because the folks I'm working with are in Austria and they came up with the example. I assume the best default encoding for them is latin-1. The application writers themselves will have no problem restricting internal filenames to be ascii. I assume it users want to save files of their own, they will choose characters from the Unicode character set they use most frequently. So, my example used latin-1. I could just as easily have chosen something else. Martin> On my system, the following works fine Martin> >>> import locale ; locale.setlocale(locale.LC_ALL,"") Martin> 'LC_CTYPE=de_DE;LC_NUMERIC=de_DE;LC_TIME=de_DE;LC_COLLATE=C;LC_MONETARY=de_DE;LC_MESSAGES=de_DE;LC_PAPER=de_DE;LC_NAME=de_DE;LC_ADDRESS=de_DE;LC_TELEPHONE=de_DE;LC_MEASUREMENT=de_DE;LC_IDENTIFICATION=de_DE' Martin> >>> a = "abc\xe4\xfc\xdf.txt" u = unicode (a, "latin-1") open(u, "w") Martin> <open file 'abcäüß.txt', mode 'w' at 0x8173e88> Martin> On Unix, your best bet for file names is to trust the user's Martin> locale settings. If you do that, open will accept Unicode Martin> objects. Martin> What is your locale? The above setlocale call prints 'LC_CTYPE=en_US;LC_NUMERIC=en_US;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en;LC_NAME=en;LC_ADDRESS=en;LC_TELEPHONE=en;LC_MEASUREMENT=en;LC_IDENTIFICATION=en' I can't get to the machines in Austria right now to see how their locales are set, though I suspect they haven't fiddled their LC_* environment, because they are having the problems I described. >> Is that the correct approach? Apparently Python's file object >> doesn't do this under the covers. Should it? Martin> No. There is no established convention, on Unix, how to do Martin> non-ASCII file names. If anything, following the user's locale Martin> setting is the most reasonable thing to do; this should be in Martin> synch of how the user's terminal displays characters. The Python Martin> installations' default encoding is almost useless, and shouldn't Martin> be changed. Martin> On Windows, things are much better, since there a notion of Martin> Unicode file names in the system. This suggests to me that the Python docs need some introductory material on this topic. It appears to me that there are two people in the Python community who live and breathe this stuff are you, Martin, and Marc-André. For most of the rest of us, especially if we've never conciously written code for consumption outside an ascii environment, the whole thing just looks like a quagmire. Skip

Well, then accept the assumption that almost everybody will use an ASCII superset. That may be still wrong, for the case of EBCDIC users, but those are rare on Unix. However, on our typical Unix system, three different encodings are in use: ISO-8859-1 (for tradition), ISO-8859-15 (because it has the Euro), and UTF-8 (because it removes all the limitations). Notice that all of our users speak German, and we still could not set a meaningful site.encoding except for 'ascii'.
On any given computer I assume the people who install Python will set site.encoding appropriate to their environment.
That is probably wrong. Most users will install precompiled packages, and thus site.py will have the value that the package held, which will be 'ascii' for most packages.
Well, latin-1 does not have a Euro sign, which may be more and more of a problem.
That is a meaningful assumption. However, it is one that you have to make in your application, not one that you should users expect to make in their Python installations.
The above setlocale call prints
'LC_CTYPE=en_US;LC_NUMERIC=en_US;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en;LC_NAME=en;LC_ADDRESS=en;LC_TELEPHONE=en;LC_MEASUREMENT=en;LC_IDENTIFICATION=en'
You may want to extend your system to support the same configuration that your users have, i.e. you might want to install an Austrian locale on your system, and set LANG to de_AT. If your system also sets all the LC_ variables for you, I recommend to unset them - setting LANG is enough (to override all other LC_ variables, setting LC_ALL to de_AT should also work).
If if they set the environment variables, they'd still have the problem because your application doesn't call setlocale. I do expect that they have set LANG to de_AT, or de_AT.ISO-8859-1. Perhaps they also have this problem because they use Python 2.1 or earlier.
Well, I'd happily review any introductory material somebody else writes :-) Regards, Martin
participants (4)
-
M.-A. Lemburg
-
Martin v. Loewis
-
Neil Hodgson
-
Skip Montanaro