[issue8242] Support surrogates in import ; install Python in a non-ASCII directory
STINNER Victor
report at bugs.python.org
Sat Mar 27 02:12:37 CET 2010
New submission from STINNER Victor <victor.stinner at haypocalc.com>:
If the fullpath to the python3 binary contains a non-ASCII character and the file system encoding is ASCII, Python fails with:
---
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: can't initialize sys standard streams
ImportError: No module named encodings.utf_8
Abandon
---
The file system encoding is set to ASCII if there is no locale (eg. LANG=C).
The problem is that the command line argument, especially argv[0], is stored to a wchar_t* string using surrogates to store undecodable bytes.
Attached patch fixes calculate_path() and import functions to support surrogates. Details:
* Initialize Py_FileSystemDefaultEncoding earlier in Py_InitializeEx(), because its value is required to encode unicode using surrogates to bytes
* Rename char2wchar() to _Py_char2wchar(), the function is not more static ; and create function _Py_wchar2char()
* Escape surrogates (reimplement surrogateescape decoder) in calculate_path() subfunctions (_wstat, _wgetcwd, _Py_wreadlink)
* Use surrogateescape error handler in find_module(), NullImporter_init() and zipimporter_init()
* Write a "fastpath" (I don't know the right term: is it an hack?) for utf-8 encoding with surrogateescape error handler in PyUnicode_AsEncodedObject() and PyUnicode_AsEncodedString(): required because these functions are called by codecs module is initialized
The patch is a work in progress: there are some FIXME (I don't know if the string should be encoded/decoded using surrogates or not).
I only tested ASCII and UTF-8 file system encodings. I don't know if we can support more encodings. Python has few builtin encodings. Other encodings are implemented in Python: we have to import them, but we need the codec to import a module, so...
I don't think that Windows is affected by this issue because it has a better API for unicode filenames and command line arguments, and most patched functions are surrounded by #ifndef WINDOWS ... #endif
----------
components: Unicode
files: surrogates_bootstrap-4.patch
keywords: patch
messages: 101815
nosy: haypo
severity: normal
status: open
title: Support surrogates in import ; install Python in a non-ASCII directory
versions: Python 3.1, Python 3.2
Added file: http://bugs.python.org/file16671/surrogates_bootstrap-4.patch
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8242>
_______________________________________
More information about the Python-bugs-list
mailing list