Re: [Python-Dev] Unicode strings as filenames
[Replacing the other mail destinations as I didn't do a reply all last time so python-dev dropped off. You may want to resend your last mail to python-dev.]
I don't think we can drop W9x support for Python 2.3, although I'm still waiting for comments on dropping W3.1 support...
I wouldn't want to drop either.
Sounds good to me. I'm moving back towards not using the 'utf-8' system encoding but rather checking of Unicode arguments and handling them explicitly even at the cost of code expansion.
That is very good. I don't know what is best for the file name; perhaps it is acceptable to encode it with the file system default encoding (even if it ends up having question marks in it). Programs relying on the file name to be correct are broken, IMO.
My thinking now is that there are two modules here, fileobject and posixmodule which should be handled differently. posixmodule is just a library with calls and no state. IIRC there used to be multiple modules, one per OS, and the correct one was chosen and called os. I think it is perfectly reasonable for there to be an extra 'ntos' module that just works on NT that treats all arguments as Unicode (coercing up using the current locale when given narrow strings) and always calling the wide APIs. It would contain the same methods (when available) as os. NT specific code can use it directly and sufficiently interested portable client code could say something like if nt: filesys = ntos else: filesys = os This hides away all the code bloat from posix code, ensures there are no regressions in posix while developing and debugging ntos, and allows ntos to just convert all arguments into wide strings without worrying about 9x. Maybe call the module osu if there may be implementations on other OS's like OS X. Could have an enquiry method in the module if osu.working: filesys = osu else: filesys = os fileobject is more complex because it holds two strings as state. The mode can probably be assumed to be ASCII so can be left as a narrow string (although it does have to be widened to call _wfopen) but the name is more complex as some client code may just know that it is always a narrow string and thus die if given a file with a wide name.
Looks very good indeed. When producing patches, you might want to check line endings: currently, your files are a mix of LF only (which was there before) and CRLF.
OK.
In open_the_file, you are still checking for utf-8; that should be removed also. It seems that open_the_file will always get an initialized filed, so passing name does not seem to be necessary: one could look at f_name.
OK. So why are the name and mode passed when they are already available?
I suggest that f_name stays as a byte string for the moment, and open_the_file gets an optional "original name" or "unicode name" argument, whatever is more convenient. If that is given, open_the_file should consider it, else it should fall back to f_name.
If this is done then the unicode name should also be available as a field of the object as those mangled "z??.html" strings are totally useless. I'm feeling more like making f_name be wide now but I'd expect some opposition now from backwards compatibility advocates.
In posixmodule, I cannot see the move towards passing Unicode objects directly, either - I guess you were talking about a future plan, above.
Yes, I'm thinking ahead of the coding. Seeing where I'm already going or about to go wrong.
I cannot see the rationale for wfuncNull - wouldn't passing passing NULL as a function pointer be sufficient as well?
Yes, must get used to thinking in C again. I don't think I have written C for 8 years. WTF can't I declare variables just when I need them <incoherent cursing and mumbling...> Neil
posixmodule is just a library with calls and no state. IIRC there used to be multiple modules, one per OS, and the correct one was chosen and called os. I think it is perfectly reasonable for there to be an extra 'ntos' module that just works on NT that treats all arguments as Unicode (coercing up using the current locale when given narrow strings) and always calling the wide APIs. It would contain the same methods (when available) as os.
I'd be all in favour of bringing ntmodule back into life, especially if that is to become a module that does not need to work on Win9x. Perhaps it can be compiled twice, once into w9x.pyd and once into nt.pyd, or the common code can be shared by means if #include. I'd also be in favour of killing all 16-bit Windows support in Python for 2.3; not sure whether 16-bit DOS needs to stay.
If this is done then the unicode name should also be available as a field of the object as those mangled "z??.html" strings are totally useless.
It is not totally useless. Most users will never see the problem, because their file names represent well in mbcs. In cases where you do get replacement characters, it is still useful, since may roughly recognize what file it is in debugging output (e.g. the file extension will be ASCII-representable in most applicatons, perhaps you get a meaningful path in there also).
I'm feeling more like making f_name be wide now but I'd expect some opposition now from backwards compatibility advocates.
I think the major problem is that performing repr on a file should work. If that turns out to use the repr of the string (can't check right now), instead of raising UnicodeErrors, my oposition to putting Unicode objects into file names is not that strong anymore.
Yes, I'm thinking ahead of the coding. Seeing where I'm already going or about to go wrong.
That looks very good indeed. I was worried about using UTF-8 as file system default encoding, because I believe that this encoding should mandated by the system API, instead of being our choice. Regards, Martin
I'd also be in favour of killing all 16-bit Windows support in Python for 2.3; not sure whether 16-bit DOS needs to stay.
I think both can be killed. Hans Novak has long stopped supporting his DOS version of Python. --Guido van Rossum (home page: http://www.python.org/~guido/)
There is an out-of-bounds error on Windows when using os.listdir("") which could result in indeterminate behaviour. After parsing the args, it does ch = namebuf[len-1]; which indexes before the array as len = 0. Possibly change this to ch = (len > 0) ? namebuf[len-1] : '\0'; Neil
Neil, thanks for the bug report, but can you please submit it to SourceForge? We don't regularly scan the archives of python-dev looking for bugs we haven't fixed yet -- but we do use SF as a reminder (and triage) system. --Guido van Rossum (home page: http://www.python.org/~guido/)
Martin:
I'd be all in favour of bringing ntmodule back into life, especially if that is to become a module that does not need to work on Win9x. Perhaps it can be compiled twice, once into w9x.pyd and once into nt.pyd, or the common code can be shared by means if #include.
I reversed again, posixmodule now detects Unicode arguments and handles them in UCS-2 rather than converting to UTF-8 and back again. This now looks like the right way to me. The total amount of code bloat is about 8K over a 150K file and this doesn't appear to be too much for me. A check is made to see if the platform supports Unicode file names and if it does not then the old conversion to Py_FileSystemDefaultEncoding is done. This means that Windows 9x should work the same as it currently does. This check is exposed as os.unicodefilenames() so that client code can decide whether to use Unicode. For other OSs that can support Unicode file names, adiitional cases can be added into posixmodule. The other platforms (OS X for example) may not provide these functions as taking UCS-2 arguments but instead UTF-8 arguments. They should still work similarly to the NT code but encode into UTF-8 before making system calls. The basic idea is that if you use a Unicode string for a file or path name in a call then returned information is in Unicode strings.
I'm feeling more like making f_name be wide now but I'd expect some opposition now from backwards compatibility advocates.
This is now done.
I think the major problem is that performing repr on a file should work. If that turns out to use the repr of the string (can't check right now), instead of raising UnicodeErrors, my oposition to putting Unicode objects into file names is not that strong anymore.
Changed the repr to display Unicode names using escapes so it does not raise errors. _getfullpathname which is available from nt and is used in ntpath now accepts a Unicode argument and then returns a Unicode path. Haven't checked ntpath to see if it will work with Unicode. New code at http://scintilla.sourceforge.net/winunichanges.zip After waiting a while for comments, I'll package this up as a patch. Neil
I reversed again, posixmodule now detects Unicode arguments and handles them in UCS-2 rather than converting to UTF-8 and back again. This now looks like the right way to me. The total amount of code bloat is about 8K over a 150K file and this doesn't appear to be too much for me.
I agree. We still should keep "mbcs", so extension modules that don't want to go through the troubles of special-casing Windows will be able to get it right most of the time.
A check is made to see if the platform supports Unicode file names and if it does not then the old conversion to Py_FileSystemDefaultEncoding is done. This means that Windows 9x should work the same as it currently does. This check is exposed as os.unicodefilenames() so that client code can decide whether to use Unicode.
That has unclear semantics for me. It sounds like "if true, you can pass Unicode strings to open etc." However, then it should return 1 on all systems, since you always can - the default encoding may apply, and restrict file names to ASCII. Or, it may mean "if true, you can pass all Unicode strings to open". This is not true, either, because there are always reserved characters (such as the path delimiter).
For other OSs that can support Unicode file names, adiitional cases can be added into posixmodule. The other platforms (OS X for example) may not provide these functions as taking UCS-2 arguments but instead UTF-8 arguments. They should still work similarly to the NT code but encode into UTF-8 before making system calls.
I think this is not needed. Instead, using setting the file system encoding to UTF-8 should be sufficient.
After waiting a while for comments, I'll package this up as a patch.
Very good. Would you also write the PEP? If not, I will, but that may take some time. Regards, Martin
Martin:
That has unclear semantics for me. It sounds like "if true, you can pass Unicode strings to open etc." However, then it should return 1 on all systems, since you always can - the default encoding may apply, and restrict file names to ASCII. Or, it may mean "if true, you can pass all Unicode strings to open". This is not true, either, because there are always reserved characters (such as the path delimiter).
OK, it means: If true, the underlying system supports file names containing most Unicode characters and any valid file name may be passed to open as a Unicode string. Yes, the "most" is fuzzy but just as with normal strings, the file system gets to put special meaning on delimiters, restrict file name length, and disallow characters such as \u0000.
After waiting a while for comments, I'll package this up as a patch.
Very good. Would you also write the PEP? If not, I will, but that may take some time.
I'll try in the next day or so but may bail if not able to work on it much as I have some backlog from spending time on this rather than other projects. Neil
If true, the underlying system supports file names containing most Unicode characters and any valid file name may be passed to open as a Unicode string.
So what is the value of exposing this to Python? It seems to be Windows-specific, so I doubt it should be generalized. Regards, Martin
Martin:
If true, the underlying system supports file names containing most Unicode characters and any valid file name may be passed to open as a Unicode string.
So what is the value of exposing this to Python? It seems to be Windows-specific, so I doubt it should be generalized.
It differentiates between those systems where open decodes Unicode file names into a particular locale (possibly losing information) and those systems that preserve Unicode file names. The set of systems where this is true could change in the future. A sufficiently motivated Windows 9x user could make it work there, possibly by looking for the long names in the directory data and converting them to short names. When this is false, client code may be prepared to offer a more reasonable error message indicating the the locale may be set incorrectly or even try multiple locales in order to open a file. Mmm, there is a Japanese character in that file name so I'll try temporarily changing the locale to Japanese to open the file. Neil
[Martin and Niel discussing various ways to add Unicode support to posixmodule] Guys, this discussion is getting somewhat out of hand. I believe that no-one on python-dev is seriously following this anymore, yet OTOH your are working on a rather important part of the Python file API. I'd suggest to write up the problem and your conclusions as a PEP for everyone to understand before actually starting to checkin anything. One thing I'd like to note (again) is that the code base is getting somewhat confusing in this area. I may be better to rip out the various bits and pieces for each supported platform and put the implementations into separate files -- much like what Greg has done for the DLL import machinery. This will reduce the levels of #ifdefs and make the whole API much more readable and understandable. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
I'd suggest to write up the problem and your conclusions as a PEP for everyone to understand before actually starting to checkin anything.
We certainly would, if we had achieved any conclusions yet. If you want, we can continue discussion in private. Regards, Martin
"Martin v. Loewis" wrote:
I'd suggest to write up the problem and your conclusions as a PEP for everyone to understand before actually starting to checkin anything.
We certainly would, if we had achieved any conclusions yet. If you want, we can continue discussion in private.
No, please keep it on python-dev; at least then the arguments will be kept in the archives. Still, I don't expect anyone here to closely follow the discussion and with most of the PythonLabs team being busy on other tasks you'll have to find some way to summarize the discussion for them and others to review at some later point in time. PEPs are the right method for this, IMHO. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
M.-A. Lemburg:
Guys, this discussion is getting somewhat out of hand. I believe that no-one on python-dev is seriously following this anymore, yet OTOH your are working on a rather important part of the Python file API.
I'd suggest to write up the problem and your conclusions as a PEP for everyone to understand before actually starting to checkin anything.
OK, PEP 277 is now available from: http://python.sourceforge.net/peps/pep-0277.html Neil
OK, PEP 277 is now available from: http://python.sourceforge.net/peps/pep-0277.html
Looks very good to me, except that the listdir approach (unicode in, unicode out) should apply uniformly to all platforms; I'll provide an add-on patch to your implementation once the PEP is approved. Regards, Martin
"Martin v. Loewis" wrote:
OK, PEP 277 is now available from: http://python.sourceforge.net/peps/pep-0277.html
Looks very good to me, except that the listdir approach (unicode in, unicode out) should apply uniformly to all platforms; I'll provide an add-on patch to your implementation once the PEP is approved.
+1 Some nits: The restriction when compiling Python in wide mode on Windows should be lifted: The PyUnicode_AsWideChar() API should be used to convert 4-byte Unicode to wchar_t (which is 2-byte on Windows). Why is "unicodefilenames" a function and not a constant ? I'm still in favour of a file API abstraction layer in Python, but that can be done at some later point (moving the code from the various platform specific modules into a Python/fileapi.c file). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
The restriction when compiling Python in wide mode on Windows should be lifted: The PyUnicode_AsWideChar() API should be used to convert 4-byte Unicode to wchar_t (which is 2-byte on Windows).
While I agree that this restriction ought to be removed eventually, I doubt that Python will be usable on Windows with a four-byte Unicode type in any foreseeable future. Just have a look at unicodeobject.c:PyUnicode_DecodeMBCS; it makes the assumption that a Py_UNICODE* is the same thing as a WCHAR*. That means that the "mbcs" encoding goes away on Windows if HAVE_USABLE_WCHAR_T does not hold anymore. Also, I believe most of PythonWin also assumes HAVE_USABLE_WCHAR_T (didn't check, though).
Why is "unicodefilenames" a function and not a constant ?
In the Windows binary, you need a run-time check to see whether this is DOS/W9x, or NT/W2k/XP; on DOS, the Unicode API is not available (you still can pass Unicode file names to open and listdir, but they will get converted through the MBCS encoding). So it clearly is not a compile time constant. I'm still not certain what the meaning of this function is, if it means "Unicode file names are only restricted by the file system conventions", then on Unix, it may change at run-time, if a user or the application sets an UTF-8 locale, switching from the original "C" locale. Regards, Martin
Martin v. Loewis:
I'm still not certain what the meaning of this function is, if it means "Unicode file names are only restricted by the file system conventions", then on Unix, it may change at run-time, if a user or the application sets an UTF-8 locale, switching from the original "C" locale.
The underlying motivation of the function is for code to be able to ask "Is it better to pass Unicode strings to file operations"? For me the main criterion for "better" is whether all files are accessible. It is best to determine this through a test that does not require writing or that is dependent on the user's setup, such as having a "C:" drive. Switching to a UTF-8 locale on Unix will make files inaccessible where their names contain illegal UTF-8 sequences. Neil
"Martin v. Loewis" wrote:
The restriction when compiling Python in wide mode on Windows should be lifted: The PyUnicode_AsWideChar() API should be used to convert 4-byte Unicode to wchar_t (which is 2-byte on Windows).
While I agree that this restriction ought to be removed eventually, I doubt that Python will be usable on Windows with a four-byte Unicode type in any foreseeable future.
Perhaps Neil ought to copy your notes to the PEP, so that we don't forget about this issue.
Just have a look at unicodeobject.c:PyUnicode_DecodeMBCS; it makes the assumption that a Py_UNICODE* is the same thing as a WCHAR*. That means that the "mbcs" encoding goes away on Windows if HAVE_USABLE_WCHAR_T does not hold anymore.
Also, I believe most of PythonWin also assumes HAVE_USABLE_WCHAR_T (didn't check, though).
Why is "unicodefilenames" a function and not a constant ?
In the Windows binary, you need a run-time check to see whether this is DOS/W9x, or NT/W2k/XP; on DOS, the Unicode API is not available (you still can pass Unicode file names to open and listdir, but they will get converted through the MBCS encoding). So it clearly is not a compile time constant.
I see.
I'm still not certain what the meaning of this function is, if it means "Unicode file names are only restricted by the file system conventions", then on Unix, it may change at run-time, if a user or the application sets an UTF-8 locale, switching from the original "C" locale.
Doesn't it mean: "posix functions and file() can accept Unicode file names" ? That's what I thought, at least; whether they succeed or not is another question and could well be handled by run-time errors (e.g. on Unix it is not at all clear whether NFS, Samba or some other more exotic file system can handle the encoding chosen by Python or the program). Perhaps we ought to drop that function altogether and let the various file IO functions raise run-time errors instead ?! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
I'm still not certain what the meaning of this function is, if it means "Unicode file names are only restricted by the file system conventions", then on Unix, it may change at run-time, if a user or the application sets an UTF-8 locale, switching from the original "C" locale.
Doesn't it mean: "posix functions and file() can accept Unicode file names" ?
Neil has given his own interpretation (return true if it is *better* to pass Unicode strings than to pass byte strings). You property (accepts Unicode) is true on all Python installations since 2.2: if you pass a Unicode string, it will try the file system encoding; if that is NULL, it will try the system encoding. So on all Python systems, open(u"foo.txt","w") currently succeeds everywhere (unless Unicode was completely disabled in the port).
That's what I thought, at least; whether they succeed or not is another question and could well be handled by run-time errors (e.g. on Unix it is not at all clear whether NFS, Samba or some other more exotic file system can handle the encoding chosen by Python or the program).
For NFS, it is clear - file names are null-terminated byte strings (AFAIK). For Samba, I believe it depends on the installation, specifically whether the encoding of Samba matches the one of the user. For more exotic file systems, it is not all that clear.
Perhaps we ought to drop that function altogether and let the various file IO functions raise run-time errors instead ?!
That was my suggestion as well. However, Neil points out that, on Windows, passing Unicode is sometimes better: For some files, there is no byte string file name to identify the file (if the file name is not representable in MBCS). OTOH, on Unix, some files cannot be accessed with a Unicode string, if the file name is invalid in the user's encoding. It turns out that only OS X really got it right: For each file, there is both a byte string name, and a Unicode name. Regards, Martin
"Martin v. Loewis" wrote:
[unicodefilenames()] Perhaps we ought to drop that function altogether and let the various file IO functions raise run-time errors instead ?!
That was my suggestion as well. However, Neil points out that, on Windows, passing Unicode is sometimes better: For some files, there is no byte string file name to identify the file (if the file name is not representable in MBCS). OTOH, on Unix, some files cannot be accessed with a Unicode string, if the file name is invalid in the user's encoding.
Sounds like the run-time error solution would at least "solve" the issue in terms of making it depend on the used file name and underlying OS or file system. I'd say: let the different file name based APIs try hard enough and then have them bail out if they can't handle the particular case.
It turns out that only OS X really got it right: For each file, there is both a byte string name, and a Unicode name.
I suppose this is due to the fact that Mac file systems store extended attributes (much like what OS/2 does too) along with the file -- that's a really nice way of being able to extend file system semantics on a per-file basis; much better than the Windows Registry or the MIME guess-by-extension mechanisms. Oh well. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
M.-A. Lemburg, regarding unicodefilenames():
Sounds like the run-time error solution would at least "solve" the issue in terms of making it depend on the used file name and underlying OS or file system.
It is much better to choose a technique that will always work rather than try to recover from a technique that may fail. unicodefilenames() can be dropped in favour of explicit OS and version checks but this is replacing a simple robust check with a more fragile one. unicodefilenames() will allow other environments to declare that client code will be more robust by choosing to use Unicode strings as file name arguments. This could include UTF-8 based systems such as OS X and BeOS, as well as Windows variants like CE. Neil
Neil Hodgson wrote:
M.-A. Lemburg, regarding unicodefilenames():
Sounds like the run-time error solution would at least "solve" the issue in terms of making it depend on the used file name and underlying OS or file system.
It is much better to choose a technique that will always work rather than try to recover from a technique that may fail.
Is it really ? The problem is that under some OSes it is possible to work with multiple very different file system from within a single Python program. In those cases, the unicodefilename() API wouldn't really help all that much.
unicodefilenames() can be dropped in favour of explicit OS and version checks but this is replacing a simple robust check with a more fragile one.
What kind of checks do you have in mind then ? If possible, it should be possible to pass unicodefilenames() a path to check for Unicode- capability, since on Unix (and probably Mac OS X as well), the path decides which file system get's the ioctrl calls.
unicodefilenames() will allow other environments to declare that client code will be more robust by choosing to use Unicode strings as file name arguments. This could include UTF-8 based systems such as OS X and BeOS, as well as Windows variants like CE.
-- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Is it really ? The problem is that under some OSes it is possible to work with multiple very different file system from within a single Python program. In those cases, the unicodefilename() API wouldn't really help all that much.
If you are thinking of Unix: It seems unicodefilename has to return 0 on Unix, meaning that you need to use byte-oriented file names if you want to access all files (not that you will be able to display all file names to the user, though ... there is nothing we can do to achieve *that*).
unicodefilenames() can be dropped in favour of explicit OS and version checks but this is replacing a simple robust check with a more fragile one.
What kind of checks do you have in mind then ? If possible, it should be possible to pass unicodefilenames() a path to check for Unicode- capability, since on Unix (and probably Mac OS X as well), the path decides which file system get's the ioctrl calls.
I think you are missing the point that unicodefilenames, as defined, does not take any parameters. It says either yay or nay. So it could be replaced in application code with if sys.platform == "win32": use_unicode_for_filenames = windowsversion in ['nt','w2k','xp'] elif sys.platform.startswith("darwin"): use_unicode_for_filenames = 1 else: use_unicode_for_filenames = 0 I would not use such code in my applications, nor would I ever use unicodefilenames. Instead, I would just use Unicode file names all the time, and risk that some users have problems with some files. Those users I would tell to fix their systems (i.e. use NT instead of Windows, or use a UTF-8 locale on Unix). Most users will never notice any problem (except for Neil, who likes to put funny file names on his disk :-), so this is a typical 80-20 problem here (or maybe rather 99-1). Regards, Martin
"Martin v. Loewis" wrote:
Is it really ? The problem is that under some OSes it is possible to work with multiple very different file system from within a single Python program. In those cases, the unicodefilename() API wouldn't really help all that much.
If you are thinking of Unix: It seems unicodefilename has to return 0 on Unix, meaning that you need to use byte-oriented file names if you want to access all files (not that you will be able to display all file names to the user, though ... there is nothing we can do to achieve *that*).
Right. I am starting to believe that unicodefilenames() doesn't really provide enough information to make it useful for cross-platform programming.
unicodefilenames() can be dropped in favour of explicit OS and version checks but this is replacing a simple robust check with a more fragile one.
What kind of checks do you have in mind then ? If possible, it should be possible to pass unicodefilenames() a path to check for Unicode- capability, since on Unix (and probably Mac OS X as well), the path decides which file system get's the ioctrl calls.
I think you are missing the point that unicodefilenames, as defined, does not take any parameters. It says either yay or nay. So it could be replaced in application code with
if sys.platform == "win32": use_unicode_for_filenames = windowsversion in ['nt','w2k','xp'] elif sys.platform.startswith("darwin"): use_unicode_for_filenames = 1 else: use_unicode_for_filenames = 0
Sounds like this would be a good candidate for platform.py which I'll check into CVS soon. With its many platform querying APIs it should easily be possible to add a function which returns the above information based on the platform Python is running on.
I would not use such code in my applications, nor would I ever use unicodefilenames. Instead, I would just use Unicode file names all the time, and risk that some users have problems with some files. Those users I would tell to fix their systems (i.e. use NT instead of Windows, or use a UTF-8 locale on Unix). Most users will never notice any problem (except for Neil, who likes to put funny file names on his disk :-), so this is a typical 80-20 problem here (or maybe rather 99-1).
I doubt that you'll have any luck in trying to convince a user to switch OSes just because Python applications don't cope with native file names. The UTF-8 locale on Unix is also hard to push: e.g. existing latin-1 file names will probably stop working the minute you switch to that locale. (I always leave the setting to "C" and simply don't use locale based file names -- that way I don't run into problems; non-[a-zA-Z0-9\-\._]+ file names are a no-go for cross-platform-code if you ask me...) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
M.-A. Lemburg:
"Martin v. Loewis" wrote:
... if sys.platform == "win32": use_unicode_for_filenames = windowsversion in ['nt','w2k','xp'] elif sys.platform.startswith("darwin"): use_unicode_for_filenames = 1 else: use_unicode_for_filenames = 0
Sounds like this would be a good candidate for platform.py which I'll check into CVS soon. With its many platform querying APIs it should easily be possible to add a function which returns the above information based on the platform Python is running on.
OK. I'll remove unicodefilenames() from the PEP and my patch. Neil
Martin v. Loewis:
Most users will never notice any problem (except for Neil, who likes to put funny file names on his disk :-), so this is a typical 80-20 problem here (or maybe rather 99-1).
While Martin is referring to the rarity of having non-native file names on Windows 9x, the problem adressed by PEP 277 is real. Already this year, there have been two enquiries [from Michael Ebert and Guenter Radestock] to comp.lang.python about Unicode file name use on NT. Neil
M.-A. Lemburg:
Is it really ? The problem is that under some OSes it is possible to work with multiple very different file system from within a single Python program. In those cases, the unicodefilename() API wouldn't really help all that much.
On NT the core file system calls are Unicode based with the narrow string calls being shims on top of this. When mounting non-native file systems, NT may perform name mapping, but that name mapping is 'complete and consistent' in that it is not possible to do anything with the narrow APIs that cannot be achieved with the Unicode APIs.
unicodefilenames() can be dropped in favour of explicit OS and version checks but this is replacing a simple robust check with a more fragile one.
What kind of checks do you have in mind then ? If possible, it should be possible to pass unicodefilenames() a path to check for Unicode- capability, since on Unix (and probably Mac OS X as well), the path decides which file system get's the ioctrl calls.
Any platform experts know how this works on MacOS X or BeOS? Do non-native file systems get mapped to Unicode names so that UTF-8 will always work? Neil
On Thursday, January 17, 2002, at 08:31 PM, Neil Hodgson wrote:
What kind of checks do you have in mind then ? If possible, it should be possible to pass unicodefilenames() a path to check for Unicode- capability, since on Unix (and probably Mac OS X as well), the path decides which file system get's the ioctrl calls.
Any platform experts know how this works on MacOS X or BeOS? Do non-native file systems get mapped to Unicode names so that UTF-8 will always work?
For Mac OS X: yes, that is how it is supposed to work.
--
- Jack Jansen
Sounds like the run-time error solution would at least "solve" the issue in terms of making it depend on the used file name and underlying OS or file system.
Such a solution is impossible to implement in some case. E.g. on Windows, if you use the ANSI (*A) APIs to list the directory contents, Windows will *silently* (AFAIK) give you incorrect file names, i.e. it will replace unrepresentable characters with the replacement char (QUESTION MARK). OTOH, on Unix, there is a better approach for listdir and unconvertable names: just return the byte strings to the user.
I'd say: let the different file name based APIs try hard enough and then have them bail out if they can't handle the particular case.
That is a good idea. However, in case of the WinNT replacement strategy, the application may still want to know. Passing *in* Unicode objects is no issue at all: If they cannot be converted to a reasonable file name, you clearly get an exception.
It turns out that only OS X really got it right: For each file, there is both a byte string name, and a Unicode name.
I suppose this is due to the fact that Mac file systems store extended attributes (much like what OS/2 does too) along with the file -- that's a really nice way of being able to extend file system semantics on a per-file basis; much better than the Windows Registry or the MIME guess-by-extension mechanisms.
I'd assume it is different: They just *define* that all local file systems they have control over use UTF-8 on disk, atleast for BSD ufs; for HFS, it might be that they 'just know' what encoding is used on an HFS partition. I doubt they use extended attributes for this, as they reportedly return UTF-8 even for file systems they've never seen before; this may be either due to static knowledge (e.g. that VFAT is UCS-2LE), or through guessing. It may be that there are also limitations and restrictions, but atleast they remove the burden from the application. Regards, Martin
"Martin v. Loewis" wrote:
Sounds like the run-time error solution would at least "solve" the issue in terms of making it depend on the used file name and underlying OS or file system.
Such a solution is impossible to implement in some case. E.g. on Windows, if you use the ANSI (*A) APIs to list the directory contents, Windows will *silently* (AFAIK) give you incorrect file names, i.e. it will replace unrepresentable characters with the replacement char (QUESTION MARK).
Samba does the same for mounted Windows shares, BTW.
OTOH, on Unix, there is a better approach for listdir and unconvertable names: just return the byte strings to the user.
Sigh.
I'd say: let the different file name based APIs try hard enough and then have them bail out if they can't handle the particular case.
That is a good idea. However, in case of the WinNT replacement strategy, the application may still want to know.
Passing *in* Unicode objects is no issue at all: If they cannot be converted to a reasonable file name, you clearly get an exception.
True and that's good :-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
On Thursday, January 17, 2002, at 12:42 PM, Martin v. Loewis wrote:
I suppose this is due to the fact that Mac file systems store extended attributes (much like what OS/2 does too) along with the file -- that's a really nice way of being able to extend file system semantics on a per-file basis; much better than the Windows Registry or the MIME guess-by-extension mechanisms.
I'd assume it is different: They just *define* that all local file systems they have control over use UTF-8 on disk, atleast for BSD ufs; for HFS, it might be that they 'just know' what encoding is used on an HFS partition. I doubt they use extended attributes for this, as they reportedly return UTF-8 even for file systems they've never seen before; this may be either due to static knowledge (e.g. that VFAT is UCS-2LE), or through guessing.
It's actually a whole lot simpler: for filesystems with an
encoding that is open to interpretation the user specifies it
during mount:-)
--
- Jack Jansen
Also, I believe most of PythonWin also assumes HAVE_USABLE_WCHAR_T (didn't check, though).
FYI, all the win32 extensions use their own Unicode API. These extensions had Unicode before Python did! These wrapper functions are abstract enough that they should be able to withstand any changes to Python's Unicode implementation quite simply - probably at the cost of extra copies and transformations in those wrappers. Mark.
M.-A. Lemburg:
The restriction when compiling Python in wide mode on Windows should be lifted: The PyUnicode_AsWideChar() API should be used to convert 4-byte Unicode to wchar_t (which is 2-byte on Windows).
I'd prefer not to include this as it adds complexity for little benefit but am prepared to do the implementation if it is required. Neil
Neil Hodgson wrote:
M.-A. Lemburg:
The restriction when compiling Python in wide mode on Windows should be lifted: The PyUnicode_AsWideChar() API should be used to convert 4-byte Unicode to wchar_t (which is 2-byte on Windows).
I'd prefer not to include this as it adds complexity for little benefit but am prepared to do the implementation if it is required.
Point taken, but please mention this in the PEP. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Martin v. Loewis:
OK, PEP 277 is now available from: http://python.sourceforge.net/peps/pep-0277.html
Looks very good to me, except that the listdir approach (unicode in, unicode out) should apply uniformly to all platforms; I'll provide an add-on patch to your implementation once the PEP is approved.
Won't this lead to a less useful result as Py_FileSystemDefaultEncoding will be NULL on, for example, Linux, so if there are names containing non-ASCII characters then it will either raise an exception or stick '?'s in the names. So it would be better to use narrow strings there as that will pass through all file names. You have probably already realised, but Windows 9x will also need a Unicode preserving listdir but it will have to encode using mbcs. Neil
Won't this lead to a less useful result as Py_FileSystemDefaultEncoding will be NULL on, for example, Linux, so if there are names containing non-ASCII characters then it will either raise an exception or stick '?'s in the names. So it would be better to use narrow strings there as that will pass through all file names.
On Linux, if the user has set LANG to a reasonable value, and the Python application has invoked setlocale(), Py_FileSystemDefaultEncoding will not be NULL. It still might happen that an individual file name cannot be decoded from the file system encoding, e.g. if the locale is set to UTF-8, but you have a Latin-1 file name (created by a different user). In that exceptional case, I would neither expect an exception, nor expect replacement characters in the Unicode string, but instead use a byte string *for this specific file name*. Just because there is there is the rare chance that you cannot meaningfully interpret a certain file name does not mean that all other installation have to suffer.
You have probably already realised, but Windows 9x will also need a Unicode preserving listdir but it will have to encode using mbcs.
Exactly. Unfortunately, we cannot do anything to avoid replacement characters here, since it is already Windows who will introduce them. In turn, we know that decoding from "mbcs" will always succeed. Regards, Martin
participants (6)
-
Guido van Rossum
-
Jack Jansen
-
M.-A. Lemburg
-
Mark Hammond
-
Martin v. Loewis
-
Neil Hodgson