[Python-Dev] My work on Python3 and non-ascii paths is done

Tue Oct 19 03:53:34 CEST 2010

Hi,

Seven months after my first commit related to this issue, the full test suite 
of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non-
ascii source directory. It means that Python 3.2 now process correctly 
filenames in all modules, build scripts and other utilities, with any locale 
encoding.

General changes:

 * Encode/decode filenames with the locale encoding, instead of utf-8,
   until the filesystem is set
 * mbcs encoding (Windows filesystem encoding) is now strict by default,
   whereas it ignores unencodable characters and replace undecodable bytes
   in Python 3.1. Old behaviour can still be used using the right error
   handler: 'ignore' to encode, 'replace' to decode.
 * tarfile uses utf-8 encoding on Windows (instead of mbcs), and the
   surrogateescape error handler on all OSes
 * sys.getfilesystemencoding() cannot be None anymore
 * Don't accept bytearray as filenames anymore

Changes of the Python API:

 * Add os.environb: bytes version of os.environ, os.getenvb() function
   and os.supports_bytes_environ constant
 * Add os.fsencode() and os.fsdecode() functions
 * Remove sys.setfilesystemencoding() function

Changes of the C API:

 * Add PyUnicode_EncodeFSDefault() function
 * Add PyUnicode_FSDecoder() ParseTuple converter
 * Add PySys_FormatStdout(), PySys_FormatStderr() and PyErr_WarnFormat()
   functions
 * Add PyUnicode_AsWideCharString() function: don't need a buffer size.
 * Add Py_UNICODE_strrchr(), Py_UNICODE_strcat(), PyUnicode_AsUnicodeCopy()
   and Py_UNICODE_strncmp() functions
 * PyUnicode_DecodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize() use the
   surrogateescape error handler
 * File utilities: add _Py_wchar2char() (reverse of Py_char2wchar()),
   _Py_stat() and _Py_fopen() functions; move all file utilities to
   Python/fileutils.c
 * The format string of PyUnicode_FromFormat() and PyErr_Format() is now
   pure ASCII: raise an error on non-ascii character
 * PyUnicode_FSConverter() doesn't accept bytearray anymore

Bugfixes:

 * Fix modules: tarfile, pickle, pickletools, ctypes, subprocess, bz2, ssl,
   profile, xmlrpclib, platform, libpython (gdb plugin), sqlite,
   distutils.log, locale, _warnings, zipimport, imp
 * Fix functions: os.exec*(), os.system(), ctypes.dlopen(), os.getenv(),
   os.get_exec_path()
 * Fix tests: test_gdb, test_httpservers, test_cmd_line, test_size,
   test_generic_path, test_subprocess, test_doctest, test_cmd_line_script
 * Fix utf-8 encoder to support error handlers producing unicode string 
   (eg. 'backslashreplace')
 * Fix conversion from unicode to a wide character string if Py_UNICODE 
   and wchar_t have different sizes: UTF-16 => UTF-32 or UTF-32 => UTF-16
 * Fix Python command line parser if the the command line contains surrogates
 * Avoid _PyUnicode_AsString() because it returns NULL if the string contains
   surrogates, or catch the error
 * Fix regrtest.py to support surrogate characters in the current working
   directory and in the tracebacks

I wrote also some tests and documentation.

The most difficult part was to debug Python initialization (Py_InitializeEx 
and calculate_path) and the import machinery (import.c, zipimport.c), because 
gdb does sometimes crash (for various reasons) and because  the import 
machinery is fragile and difficult to understand.

A special thanks to Marc-Andre Lemburg, Martin v. Löwis, Antoine Pitrou and 
Amaury Forgeot d'Arc for their help, useful advices and code reviews!

-- Bonus: short story of PYTHONFSENCODING ---

In the middle of August, I created the PYTHONFSENCODING environment variable, 
as suggested by Marc-Andre Lemburg. Because of this variable and because 
Python used utf-8 until the filesystem encoding is known, I had to write ugly 
and fragile "redecode" functions to redecode all filenames of all objects 
(sys.path, sys.meta_path, sys.executable, sys.modules, all code objects, 
etc.).

Then I found 4 issues related to PYTHONFSENCODING, inconsistencies between the 
filesystem encoding and the locale encoding. It was not easy to decide how to 
fix these issues, but at the end, we choosed to drop PYTHONFSENCODING 
variable, use the locale encoding as the filesystem encoding, and always use 
utf-8 as the filesystem encoding on Mac OS X.

-- 
Victor Stinner
http://www.haypocalc.com/