[Python-Dev] Unicode and Windows

Mark Hammond mhammond@skippinet.com.au
Mon, 20 Mar 2000 11:39:31 -0800


I would like to discuss Unicode on the Windows platform, and how it relates
to MBCS that Windows uses.

My main goal here is to ensure that Unicode on Windows can make a round-trip
to and from native Unicode stores.  As an example, let's take the registry -
a Windows user should be able to read a Unicode value from the registry then
write it back.  The value written back should be _identical_ to the value
read.  Ditto for the file system: If the filesystem is Unicode, then I would
expect the following code:
  for fname in os.listdir():
    f = open(fname + ".tmp", "w")

To create filenames on the filesystem with the exact base name even when the
basename contains non-ascii characters.


However, the Unicode patches do not appear to make this possible.  open()
uses PyArg_ParseTuple(args, "s...");  PyArg_ParseTuple() will automatically
convert a Unicode object to UTF-8, so we end up passing a UTF-8 encoded
string to the C runtime fopen function.

The end result of all this is that we end up with UTF-8 encoded names in the
registry/on the file system.  It does not seem possible to get a true
Unicode string onto either the file system or in the registry.

Unfortunately, Im not experienced enough to know the full ramifications, but
it _appears_ that on Windows the default "unicode to string" translation
should be done via the WideCharToMultiByte() API.  This will then pass an
MBCS encoded ascii string to Windows, and the "right thing" should magically
happen.  Unfortunately, MBCS encoding is dependant on the current locale
(ie, one MBCS sequence will mean completely different things depending on
the locale).  I dont see a portability issue here, as the documentation
could state that "Unicode->ASCII conversions use the most appropriate
conversion for the platform.  If the platform is not Unicode aware, then
UTF-8 will be used."

This issue is the final one before I release the win32reg module.  It seems
_critical_ to me that if Python supports Unicode and the platform supports
Unicode, then Python unicode values must be capable of being passed to the
platform.  For the win32reg module I could quite possibly hack around the
problem, but the more general problem (categorized by the open() example
above) still remains...

Any thoughts?

Mark.