
Mark Hammond wrote:
I understand the issue of "default Unicode encoding" is a loaded one, however I believe with the Windows' file system we may be able to use a default.
Windows provides 2 versions of many functions that accept "strings" - one that uses "char *" arguments, and another using "wchar *" for Unicode. Interestingly, the "char *" versions of function almost always support "mbcs" encoded strings.
To make Python work nicely with the file system, we really should handle Unicode characters somehow. It is not too uncommon to find the "program files" or the "user" directory have Unicode characters in non-english version of Win2k.
The way I see it, to fix this we have 2 basic choices when a Unicode object is passed as a filename: * we call the Unicode versions of the CRTL. * we auto-encode using the "mbcs" encoding, and still call the non-Unicode versions of the CRTL.
The first option has a problem in that determining what Unicode support Windows 95/98 have may be more trouble than it is worth. Sticking to purely ascii versions of the functions means that the worst thing that can happen is we get a regular file-system error if an mbcs encoded string is passed on a non-Unicode platform.
Does anyone have any objections to this scheme or see any drawbacks in it? If not, I'll knock up a patch...
Hmm... the problem with MBCS is that it is not one encoding, but can be many things. I don't know if this is an issue (can there be more than one encoding per process ? is the encoding a user or system setting ? does the CRT know which encoding to use/assume ?), but the Unicode approach sure sounds a lot safer. Also, what would os.listdir() return ? Unicode strings or 8-bit strings ? -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Pages: http://www.lemburg.com/python/