[Python-Dev] My work on Python3 and non-ascii paths is done
Victor Stinner
victor.stinner at haypocalc.com
Tue Oct 19 03:53:34 CEST 2010
Hi,
Seven months after my first commit related to this issue, the full test suite
of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non-
ascii source directory. It means that Python 3.2 now process correctly
filenames in all modules, build scripts and other utilities, with any locale
encoding.
General changes:
* Encode/decode filenames with the locale encoding, instead of utf-8,
until the filesystem is set
* mbcs encoding (Windows filesystem encoding) is now strict by default,
whereas it ignores unencodable characters and replace undecodable bytes
in Python 3.1. Old behaviour can still be used using the right error
handler: 'ignore' to encode, 'replace' to decode.
* tarfile uses utf-8 encoding on Windows (instead of mbcs), and the
surrogateescape error handler on all OSes
* sys.getfilesystemencoding() cannot be None anymore
* Don't accept bytearray as filenames anymore
Changes of the Python API:
* Add os.environb: bytes version of os.environ, os.getenvb() function
and os.supports_bytes_environ constant
* Add os.fsencode() and os.fsdecode() functions
* Remove sys.setfilesystemencoding() function
Changes of the C API:
* Add PyUnicode_EncodeFSDefault() function
* Add PyUnicode_FSDecoder() ParseTuple converter
* Add PySys_FormatStdout(), PySys_FormatStderr() and PyErr_WarnFormat()
functions
* Add PyUnicode_AsWideCharString() function: don't need a buffer size.
* Add Py_UNICODE_strrchr(), Py_UNICODE_strcat(), PyUnicode_AsUnicodeCopy()
and Py_UNICODE_strncmp() functions
* PyUnicode_DecodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize() use the
surrogateescape error handler
* File utilities: add _Py_wchar2char() (reverse of Py_char2wchar()),
_Py_stat() and _Py_fopen() functions; move all file utilities to
Python/fileutils.c
* The format string of PyUnicode_FromFormat() and PyErr_Format() is now
pure ASCII: raise an error on non-ascii character
* PyUnicode_FSConverter() doesn't accept bytearray anymore
Bugfixes:
* Fix modules: tarfile, pickle, pickletools, ctypes, subprocess, bz2, ssl,
profile, xmlrpclib, platform, libpython (gdb plugin), sqlite,
distutils.log, locale, _warnings, zipimport, imp
* Fix functions: os.exec*(), os.system(), ctypes.dlopen(), os.getenv(),
os.get_exec_path()
* Fix tests: test_gdb, test_httpservers, test_cmd_line, test_size,
test_generic_path, test_subprocess, test_doctest, test_cmd_line_script
* Fix utf-8 encoder to support error handlers producing unicode string
(eg. 'backslashreplace')
* Fix conversion from unicode to a wide character string if Py_UNICODE
and wchar_t have different sizes: UTF-16 => UTF-32 or UTF-32 => UTF-16
* Fix Python command line parser if the the command line contains surrogates
* Avoid _PyUnicode_AsString() because it returns NULL if the string contains
surrogates, or catch the error
* Fix regrtest.py to support surrogate characters in the current working
directory and in the tracebacks
I wrote also some tests and documentation.
The most difficult part was to debug Python initialization (Py_InitializeEx
and calculate_path) and the import machinery (import.c, zipimport.c), because
gdb does sometimes crash (for various reasons) and because the import
machinery is fragile and difficult to understand.
A special thanks to Marc-Andre Lemburg, Martin v. Löwis, Antoine Pitrou and
Amaury Forgeot d'Arc for their help, useful advices and code reviews!
-- Bonus: short story of PYTHONFSENCODING ---
In the middle of August, I created the PYTHONFSENCODING environment variable,
as suggested by Marc-Andre Lemburg. Because of this variable and because
Python used utf-8 until the filesystem encoding is known, I had to write ugly
and fragile "redecode" functions to redecode all filenames of all objects
(sys.path, sys.meta_path, sys.executable, sys.modules, all code objects,
etc.).
Then I found 4 issues related to PYTHONFSENCODING, inconsistencies between the
filesystem encoding and the locale encoding. It was not easy to decide how to
fix these issues, but at the end, we choosed to drop PYTHONFSENCODING
variable, use the locale encoding as the filesystem encoding, and always use
utf-8 as the filesystem encoding on Mac OS X.
--
Victor Stinner
http://www.haypocalc.com/
More information about the Python-Dev
mailing list