Issue #8610: Set default file system encoding to ascii on error?
Python 3.0 introduced PyUnicode_DecodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize() functions. These functions fallback to UTF-8 if getting the file system encoding failed or the encoding is unknown (there is not nl_langinfo(CODESET) function). UTF-8 is not a good choice for the fallback because it's incompatible with other encodings like Latin1. I would like to fallback to ASCII on error which is compatible with all encodings (thanks to surrogateescape). I would like to ensure that sys.getfilesystemencoding() result cannot be None, because Python 3.2 uses it on Unix to decode os.environb and to encode filenames in subprocess. I can implement a fallback for os.environb and subprocess (and other functions calling sys.getfilesystemencoding()), but I prefer to have a reliable sys.getfilesystemencoding() function. This change doesn't concern Windows and Mac OS X because the encoding is hardcoded (mbcs, utf-8). On Unix, I don't know in which case nl_langinfo() can fail. Empty environment is not an error: nl_langinfo(CODESET) returns "ascii". I think that few (or no) user would notice this change. -- Victor Stinner http://www.haypocalc.com/
Le Fri, 07 May 2010 13:05:52 +0200, Victor Stinner a écrit :
UTF-8 is not a good choice for the fallback because it's incompatible with other encodings like Latin1. I would like to fallback to ASCII on error which is compatible with all encodings (thanks to surrogateescape).
What do you mean with "compatible with all encodings thanks to surrogateescape"?
"àéè".encode("ascii", "surrogateescape") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
I would like to ensure that sys.getfilesystemencoding() result cannot be None, because Python 3.2 uses it on Unix to decode os.environb and to encode filenames in subprocess. I can implement a fallback for os.environb and subprocess (and other functions calling sys.getfilesystemencoding()), but I prefer to have a reliable sys.getfilesystemencoding() function.
Having a reliable sys.getfilesystemencoding() would be a good thing indeed.
This change doesn't concern Windows and Mac OS X because the encoding is hardcoded (mbcs, utf-8). On Unix, I don't know in which case nl_langinfo() can fail. Empty environment is not an error: nl_langinfo(CODESET) returns "ascii". I think that few (or no) user would notice this change.
Ok, it sounds like a good compromise. Regards Antoine.
Le vendredi 07 mai 2010 13:24:18, Antoine Pitrou a écrit :
UTF-8 is not a good choice for the fallback because it's incompatible with other encodings like Latin1. I would like to fallback to ASCII on error which is compatible with all encodings (thanks to surrogateescape).
What do you mean with "compatible with all encodings thanks to surrogateescape"?
"àéè".encode("ascii", "surrogateescape") ... UnicodeEncodeError: 'ascii' codec can't encode characters ...
ascii+surrogatescape can *decode* anything:
b"a\xc3\xff".decode('ascii', 'surrogateescape') 'a\udcc3\udcff'
Encode with ascii+surrogatescape raise an UnicodeEncodeError for non-ASCII (except for surrogates). I think it's better to raise an error than creating utf8 filenames on a latin1 file system. -- I forgot to mention Marc Lemburg propositing of creating a PYTHONFSENCODING environment variable: #8622. -- Victor Stinner http://www.haypocalc.com/
Victor Stinner wrote:
Python 3.0 introduced PyUnicode_DecodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize() functions. These functions fallback to UTF-8 if getting the file system encoding failed or the encoding is unknown (there is not nl_langinfo(CODESET) function).
UTF-8 is not a good choice for the fallback because it's incompatible with other encodings like Latin1. I would like to fallback to ASCII on error which is compatible with all encodings (thanks to surrogateescape).
I would like to ensure that sys.getfilesystemencoding() result cannot be None, because Python 3.2 uses it on Unix to decode os.environb and to encode filenames in subprocess. I can implement a fallback for os.environb and subprocess (and other functions calling sys.getfilesystemencoding()), but I prefer to have a reliable sys.getfilesystemencoding() function.
This change doesn't concern Windows and Mac OS X because the encoding is hardcoded (mbcs, utf-8). On Unix, I don't know in which case nl_langinfo() can fail. Empty environment is not an error: nl_langinfo(CODESET) returns "ascii". I think that few (or no) user would notice this change.
+1 on that change. The UTF-8 fallback has these major problems: * it hides errors by always having the Unicode-bytes conversion succeed * it can cause applications to write files and create directories with wrongly encoded names (e.g. use UTF-8 on a Latin-1 file system) Together with [issue8622] Add PYTHONFSENCODING environment variable: http://bugs.python.org/issue8622 which reduces the Python3 reliance on encoding guess work, the change would make Python3 more user friendly and reduce the number of bummers user run into when waking up in the all-new Unicode world of Python3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2010-04-23: Released mxODBC.Zope.DA 2.0.1 http://zope.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
participants (3)
-
Antoine Pitrou
-
M.-A. Lemburg
-
Victor Stinner