[Python-3000] Windows, sys.argv and unicode
Giovanni Bajo
rasky at develer.com
Sat Feb 16 16:20:28 CET 2008
Hello,
CPython 2.x (and 3.x) under Win32 has an issue with sys.argv. The list is
computed using the ANSI version of the windows APIs[*]. The problem is
apparent when you have a file/directory which can't be represented in the
system encoding (eg: a japanese-named file or directory on a Western
Windows), because the Windows ANSI API will encode the filename to the
system encoding using what we call the "replace" policy, and sys.argv[]
will contain an entry like "c:\\foo\\??????????????.dat".
At the moment, there's simply no way of passing such a file to a Python
script/application as an argument (eg: if you double-click on that file,
and the file is associated to a Python application). This is a wide-
spread problem among Python applications; eg. if you click on a
Japanese .torrent file, ABC (a Bittorent client written in Python) won't
be able to open it and will complain "cannot access
file ??????????.torrent".
I understand that fixing this properly in the 2.x serie might have
backward compatibility issues, but I propose that this be fixed at least
in the Python 3.x serie, and I volunteer to write a patch. I would be
glad if someone expert with ANSI/Unicode/Windows (MvL?) would show me
what he believes being the correct way of approaching this problem.
My suggestion is that:
* At the Python level, we still expose a single sys.argv[], which will
contain unicode strings. I think this exactly matches what Py3k does now.
(Back in the time, there were proposals to add a sys.argvu, but I guess
it does not make sense right now).
* At the C level, I believe it involves using GetCommandLineW() and
CommandLineToArgvW() in WinMain.c, but should Py_Main/PySys_SetArgv() be
changed to also accept wchar_t** arguments? Or is it better to allow for
NULL to be passed (under Windows at least), so that the Windows code-path
in there can use GetCommandLineW()/CommandLineToArgvW() to get the
current process' arguments?
Thanks!
[*] In detail: it actually comes from __argc/__argv (see WinMain.c),
which in turn are computed by the CRT startup code, which would adapt to
user's choice but Python is being compiled in ANSI mode.
--
Giovanni Bajo
More information about the Python-3000
mailing list