filename = u"gröt"
file = open(filename, "w") file.close()
print glob.glob(u"gr*") [u'gr\366t']
hmm. </F>
Fredrik Lundh wrote:
Where is the problem ? If you pass the output of glob() to open() you'll get the same file in both cases... even better, you can now even use Chinese in your filenames without the OS having to support Unicode filenames :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Fredrik Lundh writes:
Ummm... since I'm not sure how open() currently reacts to being passed a Unicode file or if there's something special in open() for Windows, and don't know how you think it should react (an exception? fold to UTF-8? fold to Latin1?), I don't see what the particular problem is either. For the sake of people who haven't followed this debate closely, or who were busy during the earlier lengthy threads and simply deleted most of the messages, please try to be explicit. Ilya Zakharevich on the perl5-porters mailing list often employs the "This code is buggy and if you're too clueless to see how it's broken *I* certainly won't go explaining it to you" strategy, to devastatingly divisive effect, and with little effectiveness in getting the bugs fixed. Let's not go down that road. --amk
You're asking the file system to "find you a filename". Depending on how you ask, you get two different file names for the same file. They are "==" equal (I think) but are of different length. I agree with /F that it's a little strange. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html
I presume that Fredrik's gripe is that the filename has been converted to UTF-8, while the encoding used by Windows to display his directory listing is Latin-1. (Not Microsoft's own 8-bit character set???) I'd like to solve this problem, but I have some questions: what *IS* the encoding used for filenames on Windows? This may differ per Windows version; perhaps it can differ drive letter? Or per application or per thread? On Windows NT, filenames are supposed to be Unicode. (I suppose also on Windowns 2000?) How do I open a file with a given Unicode string for its name, in a C program? I suppose there's a Win32 API call for that which has a Unicode variant. On Windows 95/98, the Unicode variants of the Win32 API calls don't exist. So what is the poor Python runtime to do there? Can Japanese people use Japanese characters in filenames on Windows 95/98? Let's assume they can. Since the filesystem isn't Unicode aware, the filenames must be encoded. Which encoding is used? Let's assume they use Microsoft's multibyte encoding. If they put such a file on a floppy and ship it to Linköping, what will Fredrik see as the filename? (I.e., is the encoding fixed by the disk volume, or by the operating system?) Once we have a few answers here, we can solve the problem. Note that sometimes we'll have to refuse a Unicode filename because there's no mapping for some of the characters it contains in the filename encoding used. Question: how does Fredrik create a file with a Euro character (u'\u20ac') in its name? --Guido van Rossum (home page: http://www.python.org/~guido/)
On Thu, 27 Apr 2000 11:23:50 -0400, you wrote:
[This is just for inspiration] JDK "solves" this by running the filename through a CharToByteConverter (a codec) which is setup as the default encoding used for the platform. On my danish w2k this is encoding happens to be called 'Cp1252'. The codec name is chosen based on the users language and region with fall back to Cp1252. The mapping table is: "ar", "Cp1256", "be", "Cp1251", "bg", "Cp1251", "cs", "Cp1250", "el", "Cp1253", "et", "Cp1257", "iw", "Cp1255", "hu", "Cp1250", "ja", "MS932", "ko", "MS949", "lt", "Cp1257", "lv", "Cp1257", "mk", "Cp1251", "pl", "Cp1250", "ro", "Cp1250", "ru", "Cp1251", "sh", "Cp1250", "sk", "Cp1250", "sl", "Cp1250", "sq", "Cp1250", "sr", "Cp1251", "th", "MS874", "tr", "Cp1254", "uk", "Cp1251", "zh", "GBK", "zh_TW", "MS950",
JDK only uses GetThreadLocale() for the starting thread. It does not appears to check for windows versions at all.
The JDK does not make use the unicode API is it exists on the platform.
JDK silently replaced the offending character with a '?' which cause an exception when attempting to open the file. The filename, directory name, or volume label syntax is incorrect
Question: how does Fredrik create a file with a Euro character (u'\u20ac') in its name?
import java.io.*; public class x { public static void main(String[] args) throws Exception { String filename = "An eurosign \u20ac"; System.out.println(filename); new FileOutputStream(filename).close(); } } The resulting file contains an euro sign when shown in FileExplorer. The output of the program also contains an euro sign when shown with notepad. But the filename/program output does *not* contain an euro when dir'ed/type'd in my DOS box. regards, finn
[Guido asks good questions about how Windows deals w/ Unicode filenames, last Thursday, but gets no answers]
I just thought I'd repeat the questions <wink>. However, I don't think you'll really want the answers -- Windows is a legacy-encrusted mess, and there are always many ways to get a thing done in the end. For example ...
Question: how does Fredrik create a file with a Euro character (u'\u20ac') in its name?
This particular one is shallower than you were hoping: in many of the TrueType fonts (e.g., Courier New but not Courier), Windows extended its Latin-1 encoding by mapping the Euro symbol to the "control character" 0x80. So I can get a Euro symbol into a file name just by typing Alt+0+1+2+8. This is true even on US Win98 (which has no visible Unicode support) -- but was not supported in US Win95. i've-been-tracking-down-what-appears-to-be-a-hw-bug-on-a-japanese-laptop- at-work-so-can-verify-ms-sure-got-japanese-characters-into-the- filenames-somehow-but-doubt-it's-via-unicode-ly y'rs - tim
Tim Peters wrote:
[Guido asks good questions about how Windows deals w/ Unicode filenames, last Thursday, but gets no answers]
you missed Finn Bock's post on how Java does it. here's another data point: Tcl uses a system encoding to convert from unicode to a suitable system API encoding, and uses the following approach to figure out what that one is: windows NT/2000: unicode (use wide api) windows 95/98: "cp%d" % GetACP() (note that this is "cp1252" in us and western europe, not "iso-8859-1") macintosh: determine encoding for fontId 0 based on (script, smScriptLanguage) tuple. if that fails, assume "macroman" unix: figure out the locale from LC_ALL, LC_CTYPE, or LANG. use heuristics to map from the locale to an encoding (see unix/tclUnixInit). if that fails, assume "iso-8859-1" I propose adding a similar mechanism to Python, along these lines: sys.getdefaultencoding() returns the right thing for windows and macintosh, "iso-8859-1" for other platforms. sys.setencoding(codec) changes the system encoding. it's used from site.py to set things up properly on unix and other non-unicode platforms. </F>
Its decided by each file system. For FAT file systems, the OEM code page is used. The OEM code page generally used in the United States is code page 437 which is different from the code page windows uses for display. I had to deal with this in a system where people used fractions (1/4, 1/2 and 3/4) as part of names which had to be converted into valid file names. For example 1/4 is 0xBC for display but 0xAC when used in a file name. In Japan, I think different manufacturers used different encodings with NEC trying to maintain market control with their own encoding. VFAT stores both Unicode long file names and shortened aliases. However the Unicode variant is hard to get to from Windows 95/98. NTFS stores Unicode.
On Windows 95/98, the Unicode variants of the Win32 API calls don't exist. So what is the poor Python runtime to do there?
Fail the call. All existing files can be opened because they have short non-Unicode aliases. If a file with a Unicode name can not be created because the OS doesn't support it then you should give up. Just as you should give up if you try to save a file with a name that includes a character not allowed by the file system.
Can Japanese people use Japanese characters in filenames on Windows 95/98?
Yes.
If Fredrik is running a non-Japanese version of Windows 9x, he will see some 'random' western characters replacing the Japanese. Neil
Neil Hodgson wrote:
Its decided by each file system.
...but the system API translates from the active code page to the encoding used by the file system, right? on my w95 box, GetACP() returns 1252, and GetOEMCP() returns 850. if I create a file with a name containing latin-1 characters, on a FAT drive, it shows up correctly in the file browser (cp1252), and also shows up correctly in the MS-DOS window (under cp850). if I print the same filename to stdout in the same DOS window, I get gibberish.
...if you fail to convert from unicode to the local code page. </F>
...but the system API translates from the active code page to the encoding used by the file system, right?
Yes, although I think that wasn't the case with Win16 and there are still some situations in which you have to deal with the differences. Copying a file from the console on Windows 95 to a FAT volume appears to allow use of the OEM character set with no conversion.
Do you have a FAT drive or a VFAT drive? If you format as FAT on 9x or NT you will get a VFAT volume. Neil
BTW, MS's use of code pages is full of shit. Yesterday I was spell-checking a document that had the name Andre in it (the accent was missing). The popup menu suggested Andr* where the * was an upper case slashed O. I first thought this was because the menu character set might be using a different code page, but no -- it must have been bad in the database, because selecting that entry from the menu actually inserted the slashed O character. So they must have been maintaining their database with a different code page. Just to indicate that when we sort out the rest of the Unicode debate (which I'm sure we will :-) there will still be surprises on Windows... --Guido van Rossum (home page: http://www.python.org/~guido/)
Fredrik Lundh wrote:
Where is the problem ? If you pass the output of glob() to open() you'll get the same file in both cases... even better, you can now even use Chinese in your filenames without the OS having to support Unicode filenames :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Fredrik Lundh writes:
Ummm... since I'm not sure how open() currently reacts to being passed a Unicode file or if there's something special in open() for Windows, and don't know how you think it should react (an exception? fold to UTF-8? fold to Latin1?), I don't see what the particular problem is either. For the sake of people who haven't followed this debate closely, or who were busy during the earlier lengthy threads and simply deleted most of the messages, please try to be explicit. Ilya Zakharevich on the perl5-porters mailing list often employs the "This code is buggy and if you're too clueless to see how it's broken *I* certainly won't go explaining it to you" strategy, to devastatingly divisive effect, and with little effectiveness in getting the bugs fixed. Let's not go down that road. --amk
You're asking the file system to "find you a filename". Depending on how you ask, you get two different file names for the same file. They are "==" equal (I think) but are of different length. I agree with /F that it's a little strange. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html
I presume that Fredrik's gripe is that the filename has been converted to UTF-8, while the encoding used by Windows to display his directory listing is Latin-1. (Not Microsoft's own 8-bit character set???) I'd like to solve this problem, but I have some questions: what *IS* the encoding used for filenames on Windows? This may differ per Windows version; perhaps it can differ drive letter? Or per application or per thread? On Windows NT, filenames are supposed to be Unicode. (I suppose also on Windowns 2000?) How do I open a file with a given Unicode string for its name, in a C program? I suppose there's a Win32 API call for that which has a Unicode variant. On Windows 95/98, the Unicode variants of the Win32 API calls don't exist. So what is the poor Python runtime to do there? Can Japanese people use Japanese characters in filenames on Windows 95/98? Let's assume they can. Since the filesystem isn't Unicode aware, the filenames must be encoded. Which encoding is used? Let's assume they use Microsoft's multibyte encoding. If they put such a file on a floppy and ship it to Linköping, what will Fredrik see as the filename? (I.e., is the encoding fixed by the disk volume, or by the operating system?) Once we have a few answers here, we can solve the problem. Note that sometimes we'll have to refuse a Unicode filename because there's no mapping for some of the characters it contains in the filename encoding used. Question: how does Fredrik create a file with a Euro character (u'\u20ac') in its name? --Guido van Rossum (home page: http://www.python.org/~guido/)
On Thu, 27 Apr 2000 11:23:50 -0400, you wrote:
[This is just for inspiration] JDK "solves" this by running the filename through a CharToByteConverter (a codec) which is setup as the default encoding used for the platform. On my danish w2k this is encoding happens to be called 'Cp1252'. The codec name is chosen based on the users language and region with fall back to Cp1252. The mapping table is: "ar", "Cp1256", "be", "Cp1251", "bg", "Cp1251", "cs", "Cp1250", "el", "Cp1253", "et", "Cp1257", "iw", "Cp1255", "hu", "Cp1250", "ja", "MS932", "ko", "MS949", "lt", "Cp1257", "lv", "Cp1257", "mk", "Cp1251", "pl", "Cp1250", "ro", "Cp1250", "ru", "Cp1251", "sh", "Cp1250", "sk", "Cp1250", "sl", "Cp1250", "sq", "Cp1250", "sr", "Cp1251", "th", "MS874", "tr", "Cp1254", "uk", "Cp1251", "zh", "GBK", "zh_TW", "MS950",
JDK only uses GetThreadLocale() for the starting thread. It does not appears to check for windows versions at all.
The JDK does not make use the unicode API is it exists on the platform.
JDK silently replaced the offending character with a '?' which cause an exception when attempting to open the file. The filename, directory name, or volume label syntax is incorrect
Question: how does Fredrik create a file with a Euro character (u'\u20ac') in its name?
import java.io.*; public class x { public static void main(String[] args) throws Exception { String filename = "An eurosign \u20ac"; System.out.println(filename); new FileOutputStream(filename).close(); } } The resulting file contains an euro sign when shown in FileExplorer. The output of the program also contains an euro sign when shown with notepad. But the filename/program output does *not* contain an euro when dir'ed/type'd in my DOS box. regards, finn
participants (9)
-
Andrew Kuchling
-
bckfnn@worldonline.dk
-
Fredrik Lundh
-
Fredrik Lundh
-
Guido van Rossum
-
M.-A. Lemburg
-
Neil Hodgson
-
Paul Prescod
-
Tim Peters