[Python-Dev] Unicode and the Windows file system.

Neil Hodgson nhodgson@bigpond.net.au
Tue, 20 Mar 2001 09:52:34 +1100


   Morning Mark,


> >    The "program files" and "user" directory should still have names
>
> "should" or "will"?

   Should. I originally wrote "will" but then thought of the scenario where
I install W2K with Russian as the default locale. The "Program Files"
directory (and other standard directories) is created with a localised name
(call it, "Russian PF" for now) including some characters not representable
in Latin 1. I then start working with a Python program and decide to change
the input locale to German. The "Russian PF" string is representable in
Unicode but not in the code page used for German so a WideCharToMultiByte
using the current code page will fail. Fail here means not that the function
will error but that a string will be constructed which will not round trip
back to Unicode and thus is unlikely to be usable to open the file.

> > representable in the normal locale used by the user so they are able to
> > access them by using their standard encoding in a Python narrow
character
> > string to the open function.
>
> I dont understand what "their standard encoding" is here.  My
understanding
> is that "their standard encoding" is whatever WideCharToMultiByte()
returns,
> and this is what mbcs is.

    WideCharToMultiByte has an explicit code page parameter so its the
caller that has to know what they want. The most common thing to do is ask
the system for the input locale and use this in the call to
WideCharToMultiByte and there are some CRT functions like wcstombs that wrap
this. Passing CP_THREAD_ACP to WideCharToMultiByte is another way. Scintilla
uses:

static int InputCodePage() {
 HKL inputLocale = ::GetKeyboardLayout(0);
 LANGID inputLang = LOWORD(inputLocale);
 char sCodePage[10];
 int res = ::GetLocaleInfo(MAKELCID(inputLang, SORT_DEFAULT),
   LOCALE_IDEFAULTANSICODEPAGE, sCodePage, sizeof(sCodePage));
 if (!res)
  return 0;
 return atoi(sCodePage);
}

   which is the result of reading various articles from MSDN and MSJ.
microsoft.public.win32.programmer.international is the news group for this
and Michael Kaplan answers a lot of these sorts of questions.

> My understanding is that their "default encoding" will bear no
relationship
> to encoding names as known by Python.  ie, given a user's locale, there is
> no reasonable way to determine which of the Python encoding names will
> always correctly work on these strings.

   Uncertain. There should be a way to get the input locale as a Python
encoding name or working on these sorts of issues will be difficult.

> Recall that my change is only to convert from Unicode to a string so the
> file system can convert back to Unicode.  There is no real opportunity for
> the current locale to change on this thread during this process.

   But the Unicode string may be non-representable using the current locale.
So doing the conversion makes the string unusable.

> My proposal was to do (3).  It is not clear from your mail what you
propose.
> Like me, you seem to agree (2) would be perfect in an ideal world, but you
> also agree we don't live in one.

   I'd prefer (2). Support Unicode well on the platforms that support it
well. Providing some help on 95 is nice but not IMO as important.

   Neil