[Python-Dev] Unicode and the Windows file system.

Mark Hammond MarkH@ActiveState.com
Tue, 20 Mar 2001 08:53:29 +1100


Hi Neil!

>    The "program files" and "user" directory should still have names

"should" or "will"?

> representable in the normal locale used by the user so they are able to
> access them by using their standard encoding in a Python narrow character
> string to the open function.

I dont understand what "their standard encoding" is here.  My understanding
is that "their standard encoding" is whatever WideCharToMultiByte() returns,
and this is what mbcs is.

My understanding is that their "default encoding" will bear no relationship
to encoding names as known by Python.  ie, given a user's locale, there is
no reasonable way to determine which of the Python encoding names will
always correctly work on these strings.

> > The way I see it, to fix this we have 2 basic choices when a Unicode
> object
> > is passed as a filename:
> > * we call the Unicode versions of the CRTL.
>
>    This is by far the better approach IMO as it is more general and will
> work for people who switch locales or who want to access files created by
> others using other locales. Although you can always use the horrid mangled
> "*~1" names.
>
> > * we auto-encode using the "mbcs" encoding, and still call the
> non-Unicode
> > versions of the CRTL.
>
>    This will improve things but to a lesser extent than the above. May be
> the best possible on 95.

I understand the above, but want to resist having different NT and 9x
versions of Python for obvious reasons.  I also wanted to avoid determining
at runtime if the platform has Unicode support and magically switching to
them.

I concur on the "may be the best possible on 95" and see no real downsides
on NT, other than the freak possibility of the default encoding being change
_between_ us encoding a string and the OS decoding it.

Recall that my change is only to convert from Unicode to a string so the
file system can convert back to Unicode.  There is no real opportunity for
the current locale to change on this thread during this process.

I guess I see 3 options:

1) Do nothing, thereby forcing the user to manually encode the Unicode
object.  Only by encoding the string can they access these filenames, which
means the exact same issues apply.

2) Move to Unicode APIs where available, which will be a much deeper patch
and much harder to get right on non-Unicode Windows platforms.

3) Like 1, but simply automate the encoding task.

My proposal was to do (3).  It is not clear from your mail what you propose.
Like me, you seem to agree (2) would be perfect in an ideal world, but you
also agree we don't live in one.

What is your recommendation?

Mark.