[issue9630] Reencode filenames of all module and code objects when setting the filesystem encoding

STINNER Victor report at bugs.python.org
Wed Aug 18 01:47:06 CEST 2010


New submission from STINNER Victor <victor.stinner at haypocalc.com>:

Python 3 has a very important variable: the filesystem encoding, sys.getfilesystemencoding(). It is used to encode and decode filenames to access to the filesystem, to encode program arguments in subprocess, etc.

The encoding is hardcoded to "mbcs" on Windows and "utf-8" on Mac OS X. On other OSes, Python gets the encoding from the locale. The problem is that the code getting the locale encoding loads Python modules (eg. locale) and Python uses a default encoding before the locale encoding is known. As a result, modules and code objects created before Python sets the locale encoding are encoded with the old encoding.

The default encoding is "utf-8". If the locale encoding is also "utf-8", there is no problem because the filename are correctly encoded. If the locale encoding is different, we keep filenames encoded in the wrong encoding.

It becomes worse when the locale encoding is unable to encode the filenames, eg. ASCII encoding.

--

A solution would be to avoid loading any Python module, but I don't think that it is possible. The locale encoding can be something different than ascii, latin-1, utf-8 or mbcs. The locale encoding can be an alias like 'utf8' (instead of 'utf-8'), 'iso-8859-1' (Python uses 'latin_1') or 'ANSI_x3.4_1968' (for 'ascii') and encoding aliases are implemented as Lib/encodings/aliases.py which is... a Python module.

--

I wrote a patch to reencode filenames of all module and code objects in initfsencoding() when the locale encoding is known.

I tested my patch on my import_unicode branch (branch to fix #8611, see also #9425: issue to merge the branch to py3k). I would like one or more reviews of the patch because it is long and complex. Please check for refleaks :-)

--

About the patch.

I don't know how to list *all* code objects and so I created a list to store weak references to all code objects, list filled by the code object constructor. The list is destroyed at initfsencoding() exit (early in Python initialization).

There is a FIXME: I don't know if sys.path_importer_cache keys should also be reencoded.

I tried to apply all remarks made on the first patch (posted on Rietveld for #9425). The patch now stores weak references instead of strong references to code objects in the code object list.

(r84168 creates PyModule_GetFilenameObject, function needed by this patch)

----------
components: Interpreter Core, Unicode
files: reencode_modules_path.patch
keywords: patch
messages: 114191
nosy: haypo
priority: normal
severity: normal
status: open
title: Reencode filenames of all module and code objects when setting the filesystem encoding
versions: Python 3.2
Added file: http://bugs.python.org/file18560/reencode_modules_path.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue9630>
_______________________________________


More information about the Python-bugs-list mailing list