I understand the issue of "default Unicode encoding" is a loaded one, however I believe with the Windows' file system we may be able to use a default. Windows provides 2 versions of many functions that accept "strings" - one that uses "char *" arguments, and another using "wchar *" for Unicode. Interestingly, the "char *" versions of function almost always support "mbcs" encoded strings. To make Python work nicely with the file system, we really should handle Unicode characters somehow. It is not too uncommon to find the "program files" or the "user" directory have Unicode characters in non-english version of Win2k. The way I see it, to fix this we have 2 basic choices when a Unicode object is passed as a filename: * we call the Unicode versions of the CRTL. * we auto-encode using the "mbcs" encoding, and still call the non-Unicode versions of the CRTL. The first option has a problem in that determining what Unicode support Windows 95/98 have may be more trouble than it is worth. Sticking to purely ascii versions of the functions means that the worst thing that can happen is we get a regular file-system error if an mbcs encoded string is passed on a non-Unicode platform. Does anyone have any objections to this scheme or see any drawbacks in it? If not, I'll knock up a patch... Mark.
Mark Hammond wrote:
I understand the issue of "default Unicode encoding" is a loaded one, however I believe with the Windows' file system we may be able to use a default.
Windows provides 2 versions of many functions that accept "strings" - one that uses "char *" arguments, and another using "wchar *" for Unicode. Interestingly, the "char *" versions of function almost always support "mbcs" encoded strings.
To make Python work nicely with the file system, we really should handle Unicode characters somehow. It is not too uncommon to find the "program files" or the "user" directory have Unicode characters in non-english version of Win2k.
The way I see it, to fix this we have 2 basic choices when a Unicode object is passed as a filename: * we call the Unicode versions of the CRTL. * we auto-encode using the "mbcs" encoding, and still call the non-Unicode versions of the CRTL.
The first option has a problem in that determining what Unicode support Windows 95/98 have may be more trouble than it is worth. Sticking to purely ascii versions of the functions means that the worst thing that can happen is we get a regular file-system error if an mbcs encoded string is passed on a non-Unicode platform.
Does anyone have any objections to this scheme or see any drawbacks in it? If not, I'll knock up a patch...
Hmm... the problem with MBCS is that it is not one encoding, but can be many things. I don't know if this is an issue (can there be more than one encoding per process ? is the encoding a user or system setting ? does the CRT know which encoding to use/assume ?), but the Unicode approach sure sounds a lot safer. Also, what would os.listdir() return ? Unicode strings or 8-bit strings ? -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Pages: http://www.lemburg.com/python/
Hmm... the problem with MBCS is that it is not one encoding, but can be many things.
Yeah, but I think specifically with filenames this is OK. We would be translating from Unicode objects using MBCS in the knowledge that somewhere in the Win32 maze they will be converted back to Unicode, using MBCS, to access the Unicode based filesystem. At the moment, you just get an exception - the dreaded "ASCII encoding error: ordinal not in range(128)" :) I don't see the harm - we are making no assumptions about the user's data, just about the platform. Note that I never want to assume a string object is in a particular encoding - just assume that the CRTL file functions can handle a particular encoding for their "filename" parameter. I don't want to handle Unicode objects in any "data" params, just the "filename". Mark.
Sorry, I notice I didn't answer your specific question:
Also, what would os.listdir() return ? Unicode strings or 8-bit strings ?
This would not change. This is what my testing shows: * I can switch to a German locale, and create a file using the keystrokes "`atest`o". The "`" is the dead-char so I get an umlaut over the first and last characters. * os.listdir() returns '\xe0test\xf2' for this file. * That same string can be passed to "open" etc to open the file. * The only way to get that string to a Unicode object is to use the encodings "Latin1" or "mbcs". Of them, "mbcs" would have to be safer, as at least it has a hope of handling non-latin characters :) So - assume I am passed a Unicode object that represents this filename. At the moment we simply throw that exception if we pass that Unicode object to open(). I am proposing that "mbcs" be used in this case instead of the default "ascii" If nothing else, my idea could be considered a "short-term" solution. If ever it is found to be a problem, we can simply move to the unicode APIs, and nothing would break - just possibly more things _would_ work :) Mark.
Mark Hammond wrote:
Sorry, I notice I didn't answer your specific question:
Also, what would os.listdir() return ? Unicode strings or 8-bit strings ?
This would not change.
This is what my testing shows:
* I can switch to a German locale, and create a file using the keystrokes "`atest`o". The "`" is the dead-char so I get an umlaut over the first and last characters.
* os.listdir() returns '\xe0test\xf2' for this file.
* That same string can be passed to "open" etc to open the file.
* The only way to get that string to a Unicode object is to use the encodings "Latin1" or "mbcs". Of them, "mbcs" would have to be safer, as at least it has a hope of handling non-latin characters :)
So - assume I am passed a Unicode object that represents this filename. At the moment we simply throw that exception if we pass that Unicode object to open(). I am proposing that "mbcs" be used in this case instead of the default "ascii"
If nothing else, my idea could be considered a "short-term" solution. If ever it is found to be a problem, we can simply move to the unicode APIs, and nothing would break - just possibly more things _would_ work :)
Sounds like a good idea. We'd only have to assure that whatever os.listdir() returns can actually be used to open the file, but that seems to be the case... at least for Latin-1 chars (I wonder how well this behaves with Japanese chars). -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Pages: http://www.lemburg.com/python/
Also, what would os.listdir() return ? Unicode strings or 8-bit strings ?
This would not change.
This is what my testing shows:
* I can switch to a German locale, and create a file using the keystrokes "`atest`o". The "`" is the dead-char so I get an umlaut over the first and last characters.
(Actually, grave accents, but I'm sure that to Aussie eyes, as to Americans, they's all Greek. :-)
* os.listdir() returns '\xe0test\xf2' for this file.
I don't understand. This is a Latin-1 string. Can you explain again how the MBCS encoding encodes characters outside the Latin-1 range?
* That same string can be passed to "open" etc to open the file.
* The only way to get that string to a Unicode object is to use the encodings "Latin1" or "mbcs". Of them, "mbcs" would have to be safer, as at least it has a hope of handling non-latin characters :)
So - assume I am passed a Unicode object that represents this filename. At the moment we simply throw that exception if we pass that Unicode object to open(). I am proposing that "mbcs" be used in this case instead of the default "ascii"
If nothing else, my idea could be considered a "short-term" solution. If ever it is found to be a problem, we can simply move to the unicode APIs, and nothing would break - just possibly more things _would_ work :)
I have one more question. The plan looks decent, but I don't know the scope. Which calls do you plan to fix? --Guido van Rossum (home page: http://www.python.org/~guido/)
[Mark Hammond]
* os.listdir() returns '\xe0test\xf2' for this file.
[Guido]
I don't understand. This is a Latin-1 string. Can you explain again how the MBCS encoding encodes characters outside the Latin-1 range?
I expect this is a coincidence. MBCS is a generic term for a large number of distinct variable-length encoding schemes, one or more specific to each language. Latin-1 is a subset of some MBCS schemes, but not of others; Mark was using a German mblocale, right? Across MS's set of MBCS schemes, there's little consistency: a one-byte encoding in one of them may well be a "lead byte" (== the first byte of a two-byte encoding) in another. All this stuff is hidden under layers of macros so general that, if you code it right, you can switch between compiling MBCS code on Win95 and Unicode code on NT via setting one compiler #define. Or that's what they advertise. The multi-lingual Windows app developers at my previous employer were all bald despite being no older than 23 <wink>. ascii-boy-ly y'rs - tim
I have submitted patch #410465 for this. http://sourceforge.net/tracker/?func=detail&aid=410465&group_id=5470&atid=30 5470 Comments are in the patch, so I won't repeat them here, but I would appreciate a few reviews on the code. Particularly, my addition of a new format to PyArg_ParseTuple and the resulting extra string copy may raise a few eye-brows. I've even managed to include the new test file and its output in the patch, so it will hopefully apply cleanly and run a full test if you want to try it. Thanks, Mark.
Now that 2.1 is out the door, how do we feel about getting these Unicode changes in? Mark.
-----Original Message----- From: Mark Hammond [mailto:MarkH@ActiveState.com] Sent: Thursday, 22 March 2001 4:16 PM To: python-dev@python.org Subject: RE: [Python-Dev] Unicode and the Windows file system.
I have submitted patch #410465 for this.
http://sourceforge.net/tracker/?func=detail&aid=410465&group_id=54 70&atid=305470
Comments are in the patch, so I won't repeat them here, but I would appreciate a few reviews on the code. Particularly, my addition of a new format to PyArg_ParseTuple and the resulting extra string copy may raise a few eye-brows.
I've even managed to include the new test file and its output in the patch, so it will hopefully apply cleanly and run a full test if you want to try it.
Thanks,
Mark.
Now that 2.1 is out the door, how do we feel about getting these Unicode changes in?
http://sourceforge.net/tracker/?func=detail&aid=410465&group_id=5470&atid=305470
No problem for me, although the context-sensitive semantics of the MBCS encoding still elude me. (Who cares, it's Windows. :-) Are you & MAL capable of sorting this out? Do you want me to add a +1 comment to the tracker? --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
Now that 2.1 is out the door, how do we feel about getting these Unicode changes in?
http://sourceforge.net/tracker/?func=detail&aid=410465&group_id=5470&atid=305470
No problem for me, although the context-sensitive semantics of the MBCS encoding still elude me. (Who cares, it's Windows. :-)
Are you & MAL capable of sorting this out? Do you want me to add a +1 comment to the tracker?
I'll take care of the parser marker stuff and Mark can do the rest ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
Mark Hammond:
To make Python work nicely with the file system, we really should handle Unicode characters somehow. It is not too uncommon to find the "program files" or the "user" directory have Unicode characters in non-english version of Win2k.
The "program files" and "user" directory should still have names representable in the normal locale used by the user so they are able to access them by using their standard encoding in a Python narrow character string to the open function.
The way I see it, to fix this we have 2 basic choices when a Unicode object is passed as a filename: * we call the Unicode versions of the CRTL.
This is by far the better approach IMO as it is more general and will work for people who switch locales or who want to access files created by others using other locales. Although you can always use the horrid mangled "*~1" names.
* we auto-encode using the "mbcs" encoding, and still call the non-Unicode versions of the CRTL.
This will improve things but to a lesser extent than the above. May be the best possible on 95.
The first option has a problem in that determining what Unicode support Windows 95/98 have may be more trouble than it is worth.
None of the *W file calls are listed as supported by 95 although Unicode file names can certainly be used on FAT partitions.
* I can switch to a German locale, and create a file using the keystrokes "`atest`o". The "`" is the dead-char so I get an umlaut over the first and last characters.
Its more fun playing with a non-roman locale, and one that doesn't fit in the normal Windows code page for this sort of problem. Russian is reasonably readable for us English speakers. M.-A. Lemburg:
I don't know if this is an issue (can there be more than one encoding per process ?
There is an input locale and keyboard layout per thread.
is the encoding a user or system setting ?
There are system defaults and a menu through which you can change the locale whenever you want.
Also, what would os.listdir() return ? Unicode strings or 8-bit strings ?
There is the Windows approach of having an os.listdirW() ;) . Neil
Hi Neil!
The "program files" and "user" directory should still have names
"should" or "will"?
representable in the normal locale used by the user so they are able to access them by using their standard encoding in a Python narrow character string to the open function.
I dont understand what "their standard encoding" is here. My understanding is that "their standard encoding" is whatever WideCharToMultiByte() returns, and this is what mbcs is. My understanding is that their "default encoding" will bear no relationship to encoding names as known by Python. ie, given a user's locale, there is no reasonable way to determine which of the Python encoding names will always correctly work on these strings.
The way I see it, to fix this we have 2 basic choices when a Unicode object is passed as a filename: * we call the Unicode versions of the CRTL.
This is by far the better approach IMO as it is more general and will work for people who switch locales or who want to access files created by others using other locales. Although you can always use the horrid mangled "*~1" names.
* we auto-encode using the "mbcs" encoding, and still call the non-Unicode versions of the CRTL.
This will improve things but to a lesser extent than the above. May be the best possible on 95.
I understand the above, but want to resist having different NT and 9x versions of Python for obvious reasons. I also wanted to avoid determining at runtime if the platform has Unicode support and magically switching to them. I concur on the "may be the best possible on 95" and see no real downsides on NT, other than the freak possibility of the default encoding being change _between_ us encoding a string and the OS decoding it. Recall that my change is only to convert from Unicode to a string so the file system can convert back to Unicode. There is no real opportunity for the current locale to change on this thread during this process. I guess I see 3 options: 1) Do nothing, thereby forcing the user to manually encode the Unicode object. Only by encoding the string can they access these filenames, which means the exact same issues apply. 2) Move to Unicode APIs where available, which will be a much deeper patch and much harder to get right on non-Unicode Windows platforms. 3) Like 1, but simply automate the encoding task. My proposal was to do (3). It is not clear from your mail what you propose. Like me, you seem to agree (2) would be perfect in an ideal world, but you also agree we don't live in one. What is your recommendation? Mark.
Morning Mark,
The "program files" and "user" directory should still have names
"should" or "will"?
Should. I originally wrote "will" but then thought of the scenario where I install W2K with Russian as the default locale. The "Program Files" directory (and other standard directories) is created with a localised name (call it, "Russian PF" for now) including some characters not representable in Latin 1. I then start working with a Python program and decide to change the input locale to German. The "Russian PF" string is representable in Unicode but not in the code page used for German so a WideCharToMultiByte using the current code page will fail. Fail here means not that the function will error but that a string will be constructed which will not round trip back to Unicode and thus is unlikely to be usable to open the file.
representable in the normal locale used by the user so they are able to access them by using their standard encoding in a Python narrow character string to the open function.
I dont understand what "their standard encoding" is here. My understanding is that "their standard encoding" is whatever WideCharToMultiByte() returns, and this is what mbcs is.
WideCharToMultiByte has an explicit code page parameter so its the caller that has to know what they want. The most common thing to do is ask the system for the input locale and use this in the call to WideCharToMultiByte and there are some CRT functions like wcstombs that wrap this. Passing CP_THREAD_ACP to WideCharToMultiByte is another way. Scintilla uses: static int InputCodePage() { HKL inputLocale = ::GetKeyboardLayout(0); LANGID inputLang = LOWORD(inputLocale); char sCodePage[10]; int res = ::GetLocaleInfo(MAKELCID(inputLang, SORT_DEFAULT), LOCALE_IDEFAULTANSICODEPAGE, sCodePage, sizeof(sCodePage)); if (!res) return 0; return atoi(sCodePage); } which is the result of reading various articles from MSDN and MSJ. microsoft.public.win32.programmer.international is the news group for this and Michael Kaplan answers a lot of these sorts of questions.
My understanding is that their "default encoding" will bear no relationship to encoding names as known by Python. ie, given a user's locale, there is no reasonable way to determine which of the Python encoding names will always correctly work on these strings.
Uncertain. There should be a way to get the input locale as a Python encoding name or working on these sorts of issues will be difficult.
Recall that my change is only to convert from Unicode to a string so the file system can convert back to Unicode. There is no real opportunity for the current locale to change on this thread during this process.
My proposal was to do (3). It is not clear from your mail what you
But the Unicode string may be non-representable using the current locale. So doing the conversion makes the string unusable. propose.
Like me, you seem to agree (2) would be perfect in an ideal world, but you also agree we don't live in one.
I'd prefer (2). Support Unicode well on the platforms that support it well. Providing some help on 95 is nice but not IMO as important. Neil
participants (5)
-
Guido van Rossum
-
M.-A. Lemburg
-
Mark Hammond
-
Neil Hodgson
-
Tim Peters