[New-bugs-announce] [issue9992] Command line arguments are not correctly decoded if locale and fileystem encodings are different

STINNER Victor report at bugs.python.org
Thu Sep 30 00:36:23 CEST 2010

New submission from STINNER Victor <victor.stinner at haypocalc.com>:

On UNIX/BSD systems, Python decodes arguments with the locale encoding, whereas subprocess encodes arguments with the fileystem encoding. If both encodings are differents, we have a problem.

There was already the issue #4388 but it was closed because it was specific to old versions of Mac OS X. With the PYTHONFSENCODING environment variable (added to Python 3.2), it is easy to trigger this issue: run Python with a filesystem encoding different than the locale encoding. Attached script demonstrates the bug.


I see two possible encodings to encode and decode command line arguments (with surrogateescape error handler):

 (a) filesystem encoding
 (b) locale encoding

Decode Python command line arguments is one of the first operation executed when running Python, in the main() function. We don't have import machinery or codec API available at this moment. So I don't see how we can use the filesystem encoding here. Read issue #9630 to see how complex it is to use the filesystem encoding when initializing Python.

Use the locale encoding is easier because we already have _Py_char2wchar() and _Py_wchar2char() functions to decode/encode with the locale encoding and the surrogateescape error handler. These functions use the wchar_t* type which is less pratical than PyUnicodeObject*, but it is an advantage because wchar_t* type doesn't need Python to be completly initialized (whereas some PyUnicode methods loads modules, eg. encode and decode).

In #8775, I proposed to create a new variable to store the "command line encoding": sys.getcmdlineencoding(). But this issue was closed because there was only one use case: #4388 (which was closed but not fixed).

I don't know, or don't really care, how sys.getcmdlineencoding() should be initialized. The important point is that we have to use the same encoding to decode and encode command line arguments.


I don't really know if using another encoding is the right solution. The problem is maybe that the filesystem encoding should not be controlable by the user?

And what about environment variables: should we continue to encode and decode them with the filesystem encoding, or should we use the new "command line encoding"?

components: Interpreter Core, Unicode
files: locale_fs_encoding.py
messages: 117669
nosy: haypo
priority: normal
severity: normal
status: open
title: Command line arguments are not correctly decoded if locale and fileystem encodings are different
versions: Python 3.2
Added file: http://bugs.python.org/file19062/locale_fs_encoding.py

Python tracker <report at bugs.python.org>

More information about the New-bugs-announce mailing list