Martin v. Löwis wrote:
Guido van Rossum wrote:
Ah, sigh. I didn't know that os.listdir() behaves differently when the argument is Unicode. Does os.listdir(".") really behave differently than os.listdir(u".")? Bah! I don't think that's a very good design (although I see where it comes from). Promoting only those entries that need it seems the right solution
Unfortunately, this solution is hard to implement (I don't know whether it is implementable at all correctly; atleast on Windows, I see no way to implement it efficiently).
Here are a number of problems/questions: - On Windows, should listdir use the narrow or the wide API? Obviously the wide API, since it is not Python which returns the question marks, but the Windows API.
Right.
- But then, the wide API gives all results as Unicode. If you want to promote only those entries that need it, it really means that you only want to "demote" those that don't need it. But how can you tell whether an entry needs it? There is no API to find out. You could declare that anything with characters >128 needs it, but that would be an incompatible change: If a character >128 in the system code page is in a file name, listdir currently returns it in the system code page. It then would return a Unicode string. Applications relying on the olde behaviour would break.
We will need a Python C API that returns: * a string if the Unicode value is representable in the default encoding (usually ASCII) * Unicode if it is not The file system encoding should be hidden in the OS layer (e.g. posixmodule). Python should only return strings with the default encoding and Unicode otherwise. See my suggestion to Neil about making the transition to this new strategy less painful.
- On Unix, all file names come out as byte strings. Again, how do you know which ones to promote, and using what encoding? Python currently guesses an encoding, but that may or may not be the one intended for the file name.
This is a tough one: AFAIK the file system encoding in Unix was never really specified, in fact most file systems just stored the names as-is without any encoding information attached to it. Things are moving into the direction of using UTF-8 for filenames, though. To solve this issue, various applications have come up with ways around the problem, e.g. GTK uses the following strategy to find the encoding (in the given order and adjustable using an environment variable): 1. locale based encoding, if given (UTF-8 on most modern Unixes) 2. UTF-8 3. Latin-1 4. CP1252 (Windows Latin-1 version) Perhaps we should add similar support to Python ? We should probably use a file system encoding default of Latin-1 on Unix if no other information can be found. That way we will assure that things don't change on Unix unless explicitly setup by the user (Latin-1 is round-trip safe when converting it to Unicode and back). os.listdir() would then continue to return plain strings and file() will open them just it does now. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 15 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::