RFC: PEP 540 version 3 (Add a new UTF-8 mode)

Hi, I also implemented my PEP 540, you can now test it! Use the latest patch attached to: http://bugs.python.org/issue29240 I made multiple changes since the first version of my PEP: * The UTF-8 Strict mode now only uses strict for inputs and outputs: it keeps surrogateescape for operating system data. Read the "Use the strict error handler for operating system data" alternative for the rationale. * The POSIX locale now enables the UTF-8 mode. See the "Don't modify the encoding of the POSIX locale" alternative for the rationale. * Specify the priority between -X utf8, PYTHONUTF8, PYTHONIOENCODING, etc. The PEP version 3 has a longer rationale with more example. IMHO the "List a directory into stdout" use case is the most representative case of "UNIX should just work" thing and encoding issues: https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-stdout It reads data from the operating system (directory content) and writes it into an output (stdout). It combines two things which are similar but different in subtle ways. I included example with commands and their output to this use case, to have a more "real world" example instead of a long list of theorical things :-) Read the PEP 540 online (HTML): https://www.python.org/dev/peps/pep-0540/ Full text below. Victor PEP: 540 Title: Add a new UTF-8 mode Version: $Revision$ Last-Modified: $Date$ Author: Victor Stinner <victor.stinner@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 5-January-2016 Python-Version: 3.7 Abstract ======== Add a new UTF-8 mode, disabled by default, to ignore the locale and force the usage of the UTF-8 encoding. Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't bother users with encodings, but it can produce mojibake. The UTF-8 mode can be configured as strict to prevent mojibake. New ``-X utf8`` command line option and ``PYTHONUTF8`` environment variable are added to control the UTF-8 mode. The POSIX locale enables the UTF-8 mode. Rationale ========= "It's not a bug, you must fix your locale" is not an acceptable answer ---------------------------------------------------------------------- Since Python 3.0 was released in 2008, the usual answer to users getting Unicode errors is to ask developers to fix their code to handle Unicode properly. Most applications and Python modules were fixed, but users keep reporting Unicode errors regulary: see the long list of issues in the `Links`_ section below. In fact, a second class of bug comes from a locale which is not properly configured. The usual answer to such bug report is: "it is not a bug, you must fix your locale". Technically, the answer is correct, but from a practical point of view, the answer is not acceptable. In many cases, "fixing the issue" is an hard task. Moreover, sometimes, the usage of the POSIX locale is deliberate. A good example of a concrete issue are build systems which create a fresh environment for each build using a chroot, a container, a virtual machine or something else to get reproductible builds. Such setup usually uses the POSIX locale. To get 100% reproductible builds, the POSIX locale is a good choice: see the `Locales section of reproducible-builds.org <https://reproducible-builds.org/docs/locales/>`_. UNIX users don't expect Unicode errors, since the common command lines tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors. These users expect that Python 3 "just works" with any locale and don't bother them with encodings. From their point of the view, the bug is not their locale but is obviously Python 3. Since Python 2 handles data as bytes, it's rarer in Python 2 compared to Python 3 to get Unicode errors. It also explains why users also perceive Python 3 as the root cause of their Unicode errors. Some users expect that Python 3 just works with any locale and so don't bother with mojibake, whereas some developers are working hard to prevent mojibake and so expect that Python 3 fails early before creating mojibake. Since different group of users have different expectations, there is no silver bullet which solves all issues at once. Last but not least, backward compatibility should be preserved whenever possible. Locale and operating system data -------------------------------- .. _operating system data: Python uses an encoding called the "filesystem encoding" to decide how to encode and decode data from/to the operating system: * file content * command line arguments: ``sys.argv`` * standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr`` * environment variables: ``os.environ`` * filenames: ``os.listdir(str)`` for example * pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example * error messages: ``os.strerror(code)`` for example * user and terminal names: ``os``, ``grp`` and ``pwd`` modules * host name, UNIX socket path: see the ``socket`` module * etc. At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user ``LC_CTYPE`` locale and then store the locale encoding as the "filesystem error". It's possible to get this encoding using ``sys.getfilesystemencoding()``. In the whole lifetime of a Python process, the same encoding and error handler are used to encode and decode data from/to the operating system. The ``os.fsdecode()`` and ``os.fsencode()`` functions can be used to decode and encode operating system data. These functions use the filesystem error handler: ``sys.getfilesystemencodeerrors()``. .. note:: In some corner case, the *current* ``LC_CTYPE`` locale must be used instead of ``sys.getfilesystemencoding()``. For example, the ``time`` module uses the *current* ``LC_CTYPE`` locale to decode timezone names. The POSIX locale and its encoding --------------------------------- The following environment variables are used to configure the locale, in this preference order: * ``LC_ALL``, most important variable * ``LC_CTYPE`` * ``LANG`` The POSIX locale,also known as "the C locale", is used: * if the first set variable is set to ``"C"`` * if all these variables are unset, for example when a program is started in an empty environment. The encoding of the POSIX locale must be ASCII or a superset of ASCII. On Linux, the POSIX locale uses the ASCII encoding. On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions use the ISO 8859-1 encoding (Latin1) in practice. The problem is that ``os.fsencode()`` and ``os.fsdecode()`` use ``locale.getpreferredencoding()`` codec. For example, if command line arguments are decoded by ``mbstowcs()`` and encoded back by ``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead of retrieving the original byte string. To fix this issue, Python checks since Python 3.4 if ``mbstowcs()`` really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an alias to ASCII). If not (the effective encoding is not ASCII), Python uses its own ASCII codec instead of using ``mbstowcs()`` and ``wcstombs()`` functions for `operating system data`_. See the `POSIX locale (2016 Edition) <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_. POSIX locale used by mistake ---------------------------- In many cases, the POSIX locale is not really expected by users who get it by mistake. Examples: * program started in an empty environment * User forcing LANG=C to get messages in english * LANG=C used for bad reasons, without being aware of the ASCII encoding * SSH shell * Linux installed with no configured locale * chroot environment, Docker image, container, ... with no locale is configured * User locale set to a non-existing locale, typo in the locale name for example C.UTF-8 and C.utf8 locales -------------------------- Some UNIX operating systems provide a variant of the POSIX locale using the UTF-8 encoding: * Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"`` * Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"`` * HP-UX: ``"C.utf8"`` It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8 proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_. It is not planned to add such locale to BSD systems. Popularity of the UTF-8 encoding -------------------------------- Python 3 uses UTF-8 by default for Python source files. On Mac OS X, Windows and Android, Python always use UTF-8 for operating system data. For Windows, see the `PEP 529`_: "Change Windows filesystem encoding to UTF-8". On Linux, UTF-8 became the de facto standard encoding, replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example, using different encodings for filenames and standard streams is likely to create mojibake, so UTF-8 is now used *everywhere*. The UTF-8 encoding is the default encoding of XML and JSON file format. In January 2017, UTF-8 was used in `more than 88% of web pages <https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML, Javascript, CSS, etc.). See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general information on the UTF-8 codec. .. note:: Some applications and operating systems (especially Windows) use Byte Order Markers (BOM) to indicate the used Unicode encoding: UTF-7, UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in Python. Old data stored in different encodings and surrogateescape ---------------------------------------------------------- Even if UTF-8 became the de facto standard, there are still systems in the wild which don't use UTF-8. And there are a lot of data stored in different encodings. For example, an old USB key using the ext3 filesystem with filenames encoded to ISO 8859-1. The Linux kernel and the libc don't decode filenames: a filename is used as a raw array of bytes. The common solution to support any filename is to store filenames as bytes and don't try to decode them. When displayed to stdout, mojibake is displayed if the filename and the terminal don't use the same encoding. Python 3 promotes Unicode everywhere including filenames. A solution to support filenames not decodable from the locale encoding was found: the ``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes as surrogate characters. This error handler is used by default for `operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for example (except on Windows which uses the ``strict`` error handler). Standard streams ---------------- Python uses the locale encoding for standard streams: stdin, stdout and stderr. The ``strict`` error handler is used by stdin and stdout to prevent mojibake. The ``backslashreplace`` error handler is used by stderr to avoid Unicode encode error when displaying non-ASCII text. It is especially useful when the POSIX locale is used, because this locale usually uses the ASCII encoding. The problem is that `operating system data`_ like filenames are decoded using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a filename to stdout raises a Unicode encode error if the filename contains an undecoded byte stored as a surrogate character. Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the POSIX locale is used: `issue #19977 <http://bugs.python.org/issue19977>`_. The idea is to passthrough `operating system data`_ even if it means mojibake, because most UNIX applications work like that. Most UNIX applications store filenames as bytes, usually simply because bytes are first-citizen class in the used programming language, whereas Unicode is badly supported. .. note:: The encoding and/or the error handler of standard streams can be overriden with the ``PYTHONIOENCODING`` environment variable. Proposal ======== Changes ------- Add a new UTF-8 mode, disabled by default, to ignore the locale and force the usage of the UTF-8 encoding with the ``surrogateescape`` error handler, instead using the locale encoding (with ``strict`` or ``surrogateescape`` error handler depending on the case). Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't bother users with encodings, but it can produce mojibake. It can be configured as strict to prevent mojibake: the UTF-8 encoding is used with the ``strict`` error handler for inputs and outputs, but the ``surrogateescape`` error handler is still used for `operating system data`_. New ``-X utf8`` command line option and ``PYTHONUTF8`` environment variable are added to control the UTF-8 mode. The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1``. The UTF-8 is configured as strict by ``-X utf8=strict`` or ``PYTHONUTF8=strict``. Other option values fail with an error. The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``. The ``-X utf8`` has the priority over the ``PYTHONUTF8`` environment variable. For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the UTF-8 mode. Encoding and error handler -------------------------- The UTF-8 mode changes the default encoding and error handler used by open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and sys.stderr: ============================ ======================= ========================== ========================== Function Default UTF-8 or POSIX locale UTF-8 Strict ============================ ======================= ========================== ========================== open() locale/strict **UTF-8/surrogateescape** **UTF-8**/strict os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8**/surrogateescape sys.stdin, sys.stdout locale/strict **UTF-8/surrogateescape** **UTF-8**/strict sys.stderr locale/backslashreplace **UTF-8**/backslashreplace **UTF-8**/backslashreplace ============================ ======================= ========================== ========================== By comparison, Python 3.6 uses: ============================ ======================= ========================== Function Default POSIX locale ============================ ======================= ========================== open() locale/strict locale/strict os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape sys.stdin, sys.stdout locale/strict locale/**surrogateescape** sys.stderr locale/backslashreplace locale/backslashreplace ============================ ======================= ========================== The UTF-8 mode uses the ``surrogateescape`` error handler instead of the strict mode for convenience: the idea is that data not encoded to UTF-8 are passed through "Python" without being modified, as raw bytes. The ``PYTHONIOENCODING`` environment variable has the priority on the UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1 python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr. Encodings used by ``open()``, highest priority first: * *encoding* and *errors* parameters (if set) * UTF-8 mode * os.device_encoding(fd) * os.getpreferredencoding(False) Rationale --------- The UTF-8 mode is disabled by default to keep hard Unicode errors when encoding or decoding `operating system data`_ failed, and to keep the backward compatibility. The user is responsible to enable explicitly the UTF-8 mode, and so is better prepared for mojibake than if the UTF-8 mode would be enabled *by default*. The UTF-8 mode should be used on systems known to be configured with UTF-8 where most applications speak UTF-8. It prevents Unicode errors if the user overrides a locale *by mistake* or if a Python program is started with no locale configured (and so with the POSIX locale). Most UNIX applications handle `operating system data`_ as bytes, so ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a limited impact on how these data are handled by the application. The Python UTF-8 mode should help to make Python more interoperable with the other UNIX applications in the system assuming that *UTF-8* is used everywhere and that users *expect* UTF-8. Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in Python is more convenient, since they are more commonly misconfigured *by mistake* (configured to use an encoding different than UTF-8, whereas the system uses UTF-8), rather than being misconfigured by intent. Expected mojibake and surrogate character issues ------------------------------------------------ The UTF-8 mode only affects code running directly in Python, especially code written in pure Python. The other code, called "external code" here, is not aware of this mode. Examples: * C libraries called by Python modules like OpenSSL * The application code when Python is embedded in an application In the UTF-8 mode, Python uses the ``surrogateescape`` error handler which stores bytes not decodable from UTF-8 as surrogate characters. If the external code uses the locale and the locale encoding is UTF-8, it should work fine. External code using bytes ^^^^^^^^^^^^^^^^^^^^^^^^^ If the external code process data as bytes, surrogate characters are not an issue since they are only used inside Python. Python encodes back surrogate characters to bytes at the edges, before calling external code. The UTF-8 mode can produce mojibake since Python and external code don't both of invalid bytes, but it's a deliberate choice. The UTF-8 mode can be configured as strict to prevent mojibake and be fail early when data is not decodable from UTF-8 or not encodable to UTF-8. External code using text ^^^^^^^^^^^^^^^^^^^^^^^^ If the external code uses text API, for example using the ``wchar_t*`` C type, mojibake should not occur, but the external code can fail on surrogate characters. Use Cases ========= The following use cases were written to help to understand the impact of chosen encodings and error handlers on concrete examples. The "Always work" results were written to prove the benefit of having a UTF-8 mode which works with any data and any locale, compared to the existing old Python versions. The "Mojibake" column shows that ignoring the locale causes a pratical issue: the UTF-8 mode produces mojibake if the terminal doesn't use the UTF-8 encoding. List a directory into stdout ---------------------------- Script listing the content of the current directory into stdout:: import os for name in os.listdir(os.curdir): print(name) Result: ======================== ============ ========= Python Always work? Mojibake? ======================== ============ ========= Python 2 **Yes** **Yes** Python 3 No No Python 3.5, POSIX locale **Yes** **Yes** UTF-8 mode **Yes** **Yes** UTF-8 Strict mode No No ======================== ============ ========= "No" means that the script can fail on decoding or encoding a filename depending on the locale or the filename. To be able to always work, the program must be able to produce mojibake. Mojibake is more user friendly than an error with a truncated or empty output. Example with a directory which contains the file called ``b'xxx\xff'`` (the byte ``0xFF`` is invalid in UTF-8). Default and UTF-8 Strict mode fail on ``print()`` with an encode error:: $ python3.7 ../ls.py Traceback (most recent call last): File "../ls.py", line 5, in <module> print(name) UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ... $ python3.7 -X utf8=strict ../ls.py Traceback (most recent call last): File "../ls.py", line 5, in <module> print(name) UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ... The UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work but display mojibake:: $ python3.7 -X utf8 ../ls.py xxx� $ LC_ALL=C /python3.6 ../ls.py xxx� $ python2 ../ls.py xxx� $ ls 'xxx'$'\377' List a directory into a text file --------------------------------- Similar to the previous example, except that the listing is written into a text file:: import os names = os.listdir(os.curdir) with open("/tmp/content.txt", "w") as fp: for name in names: fp.write("%s\n" % name) Result: ======================== ============ ========= Python Always work? Mojibake? ======================== ============ ========= Python 2 **Yes** **Yes** Python 3 No No Python 3.5, POSIX locale No No UTF-8 mode **Yes** **Yes** UTF-8 Strict mode No No ======================== ============ ========= "Yes" involves that mojibake can be produced. "No" means that the script can fail on decoding or encoding a filename depending on the locale or the filename. Typical error:: $ LC_ALL=C python3 test.py Traceback (most recent call last): File "test.py", line 5, in <module> fp.write("%s\n" % name) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) Display Unicode characters into stdout -------------------------------------- Very basic example used to illustrate a common issue, display the euro sign (U+20AC: €):: print("euro: \u20ac") Result: ======================== ============ ========= Python Always work? Mojibake? ======================== ============ ========= Python 2 No No Python 3 No No Python 3.5, POSIX locale No No UTF-8 mode **Yes** **Yes** UTF-8 Strict mode **Yes** **Yes** ======================== ============ ========= The UTF-8 and UTF-8 Strict modes will always encode the euro sign as UTF-8. If the terminal uses a different encoding, we get mojibake. Replace a word in a text ------------------------ The following scripts replaces the word "apple" with "orange". It reads input from stdin and writes the output into stdout:: import sys text = sys.stdin.read() sys.stdout.write(text.replace("apple", "orange")) Result: ======================== ============ ========= Python Always work? Mojibake? ======================== ============ ========= Python 2 **Yes** **Yes** Python 3 No No Python 3.5, POSIX locale **Yes** **Yes** UTF-8 mode **Yes** **Yes** UTF-8 Strict mode No No ======================== ============ ========= Producer-consumer model using pipes ----------------------------------- Let's say that we have a "producer" program which writes data into its stdout and a "consumer" program which reads data from its stdin. On a shell, such programs are run with the command:: producer | consumer The question if these programs will work with any data and any locale. UNIX users don't expect Unicode errors, and so expect that such programs "just works". If the producer only produces ASCII output, no error should occur. Let's say the that producer writes at least one non-ASCII character (at least one byte in the range ``0x80..0xff``). To simplify the problem, let's say that the consumer has no output (don't write result into a file or stdout). A "Bytes producer" is an application which cannot fail with a Unicode error and produces bytes into stdout. Let's say that a "Bytes consumer" does not decode stdin but stores data as bytes: such consumer always work. Common UNIX command line tools like ``cat``, ``grep`` or ``sed`` are in this category. Many Python 2 applications are also in this category. "Python producer" and "Python consumer" are producer and consumer implemented in Python. Bytes producer, Bytes consumer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It always work, but it is out of the scope of this PEP since it doesn't involve Python. Python producer, Bytes consumer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Python producer:: print("euro: \u20ac") Result: ======================== ============ ========= Python Always work? Mojibake? ======================== ============ ========= Python 2 No No Python 3 No No Python 3.5, POSIX locale No No UTF-8 mode **Yes** **Yes** UTF-8 Strict mode No No ======================== ============ ========= The question here is not if the consumer is able to decode the input, but if Python is able to produce its ouput. So it's similar to the `Display Unicode characters into stdout`_ case. UTF-8 modes work with any locale since the consumer doesn't try to decode its stdin. Bytes producer, Python consumer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Python consumer:: import sys text = sys.stdin.read() result = text.replace("apple", "orange") # ignore the result Result: ======================== ============ ========= Python Always work? Mojibake? ======================== ============ ========= Python 2 **Yes** **Yes** Python 3 No No Python 3.5, POSIX locale **Yes** **Yes** UTF-8 mode **Yes** **Yes** UTF-8 Strict mode No No ======================== ============ ========= Python 3 fails on decoding stdin depending on the input and the locale. Python producer, Python consumer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Python producer:: print("euro: \u20ac") Python consumer:: import sys text = sys.stdin.read() result = text.replace("apple", "orange") # ignore the result Result, same Python version used for the producer and the consumer: ======================== ============ ========= Python Always work? Mojibake? ======================== ============ ========= Python 2 No No Python 3 No No Python 3.5, POSIX locale No No UTF-8 mode **Yes** **Yes** UTF-8 Strict mode No No ======================== ============ ========= This case combines a Python producer with a Python consumer, so the result is the subset of `Python producer, Bytes consumer`_ and `Bytes producer, Python consumer`_. Backward Compatibility ====================== The main backward incompatible change is that the UTF-8 encoding is now used by default if the locale is POSIX. Since the UTF-8 encoding is used with the ``surrogateescape`` error handler, encoding errors should not occur and so the change should not break applications. The more likely source of trouble comes from external libraries. Python can decode successfully data from UTF-8, but a library using the locale encoding can fail to encode the decoded text back to bytes. Hopefully, encoding text in a library is a rare operation. Very few libraries expect text, most libraries expect bytes and even manipulate bytes internally. The PEP only changes the default behaviour if the locale is POSIX. For other locales, the *default* behaviour is unchanged. Alternatives ============ Don't modify the encoding of the POSIX locale --------------------------------------------- A first version of the PEP did not change the encoding and error handler used of the POSIX locale. The problem is that adding the ``-X utf8`` command line option or setting the ``PYTHONUTF8`` environment variable is not possible in some cases, or at least not convenient. Moreover, many users simply expect that Python 3 behaves as Python 2: don't bother them with encodings and "just works" in all cases. These users don't worry about mojibake, or even expect mojibake because of complex documents using multiple incompatibles encodings. Always use UTF-8 ---------------- Python already always use the UTF-8 encoding on Mac OS X, Android and Windows. Since UTF-8 became the de facto encoding, it makes sense to always use it on all platforms with any locale. The risk is to introduce mojibake if the locale uses a different encoding, especially for locales other than the POSIX locale. Force UTF-8 for the POSIX locale -------------------------------- An alternative to always using UTF-8 in any case is to only use UTF-8 when the ``LC_CTYPE`` locale is the POSIX locale. The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of Nick Coghlan proposes to implement that using the ``C.UTF-8`` locale. Use the strict error handler for operating system data ------------------------------------------------------ Using the ``surrogateescape`` error handler for `operating system data`_ creates surprising surrogate characters. No Python codec (except of ``utf-7``) accept surrogates, and so encoding text coming from the operating system is likely to raise an error error. The problem is that the error comes late, very far from where the data was read. The ``strict`` error handler can be used instead to decode (``os.fsdecode()``) and encode (``os.fsencode()``) operating system data, to raise encoding errors as soon as possible. It helps to find bugs more quickly. The main drawback of this strategy is that it doesn't work in practice. Python 3 is designed on top on Unicode strings. Most functions expect Unicode and produce Unicode. Even if many operating system functions have two flavors, bytes and Unicode, the Unicode flavar is used is most cases. There are good reasons for that: Unicode is more convenient in Python 3 and using Unicode helps to support the full Unicode Character Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6, see the `PEP 528`_ and the `PEP 529`_). For example, if ``os.fsdecode()`` uses ``utf8/strict``, ``os.listdir(str)`` fails to list filenames of a directory if a single filename is not decodable from UTF-8. As a consequence, ``shutil.rmtree(str)`` fails to remove a directory. Undecodable filenames, environment variables, etc. are simply too common to make this alternative viable. Links ===== PEPs: * `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_: "Coercing the legacy C locale to C.UTF-8" * `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_: "Change Windows filesystem encoding to UTF-8" * `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_: "Change Windows console encoding to UTF-8" * `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_: "Non-decodable Bytes in System Character Interfaces" Main Python issues: * `Issue #29240: Implementation of the PEP 540: Add a new UTF-8 mode <http://bugs.python.org/issue29240>`_ * `Issue #28180: sys.getfilesystemencoding() should default to utf-8 <http://bugs.python.org/issue28180>`_ * `Issue #19977: Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale <http://bugs.python.org/issue19977>`_ * `Issue #19847: Setting the default filesystem-encoding <http://bugs.python.org/issue19847>`_ * `Issue #8622: Add PYTHONFSENCODING environment variable <https://bugs.python.org/issue8622>`_: added but reverted because of many issues, read the `Inconsistencies if locale and filesystem encodings are different <https://mail.python.org/pipermail/python-dev/2010-October/104509.html>`_ thread on the python-dev mailing list Incomplete list of Python issues related to Unicode errors, especially with the POSIX locale: * 2016-12-22: `LANG=C python3 -c "import os; os.path.exists('\xff')" <http://bugs.python.org/issue29042#msg283821>`_ * 2014-07-20: `issue #22016: Add a new 'surrogatereplace' output only error handler <http://bugs.python.org/issue22016>`_ * 2014-04-27: `Issue #21368: Check for systemd locale on startup if current locale is set to POSIX <http://bugs.python.org/issue21368>`_ -- read manually /etc/locale.conf when the locale is POSIX * 2014-01-21: `Issue #20329: zipfile.extractall fails in Posix shell with utf-8 filename <http://bugs.python.org/issue20329>`_ * 2013-11-30: `Issue #19846: Python 3 raises Unicode errors with the C locale <http://bugs.python.org/issue19846>`_ * 2010-05-04: `Issue #8610: Python3/POSIX: errors if file system encoding is None <http://bugs.python.org/issue8610>`_ * 2013-08-12: `Issue #18713: Clearly document the use of PYTHONIOENCODING to set surrogateescape <http://bugs.python.org/issue18713>`_ * 2013-09-27: `Issue #19100: Use backslashreplace in pprint <http://bugs.python.org/issue19100>`_ * 2012-01-05: `Issue #13717: os.walk() + print fails with UnicodeEncodeError <http://bugs.python.org/issue13717>`_ * 2011-12-20: `Issue #13643: 'ascii' is a bad filesystem default encoding <http://bugs.python.org/issue13643>`_ * 2011-03-16: `issue #11574: TextIOWrapper should use UTF-8 by default for the POSIX locale <http://bugs.python.org/issue11574>`_, thread on python-dev: `Low-Level Encoding Behavior on Python 3 <https://mail.python.org/pipermail/python-dev/2011-March/109361.html>`_ * 2010-04-26: `Issue #8533: regrtest: use backslashreplace error handler for stdout <http://bugs.python.org/issue8533>`_, regrtest fails with Unicode encode error if the locale is POSIX Some issues are real bug in applications which must set explicitly the encoding. Well, it just works in the common case (locale configured correctly), so what? But the program "suddenly" fails when the POSIX locale is used (probably for bad reasons). Such bug is not well understood by users. Example of such issue: * 2013-11-21: `pip: open() uses the locale encoding to parse Python script, instead of the encoding cookie <http://bugs.python.org/issue19685>`_ -- pip must use the encoding cookie to read a Python source code file * 2011-01-21: `IDLE 3.x can crash decoding recent file list <http://bugs.python.org/issue10974>`_ Prior Art ========= Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment varaible to force UTF-8: see `perlrun <http://perldoc.perl.org/perlrun.html>`_. It is possible to configure UTF-8 per standard stream, on input and output streams, etc. Copyright ========= This document has been placed in the public domain.

On 12 January 2017 at 08:15, Victor Stinner <victor.stinner@gmail.com> wrote:
Thanks Victor, I really like this version, and the next time I update PEP 538 I'm going to replace the en_US.UTF-8 fallback in the current proposal with a dependency on this PEP. My one comment would be that in the summary tables, "Always works" isn't the right phrase to describe potentially corrupting text data instead of throwing an exception :) Instead, I think it would make sense to retitle that column as "Exception?" such that: * the ideal state is "No exception, no mojibake", which is what we'll now get when assuming (or forcing) UTF-8 is the correct thing to do, and will also continue to get when the locale is set appropriately (e.g. when handling GB18030 on Chinese systems) * the problematic behaviour of earlier Python 3.x versions was "Yes exception, no mojibake" when it assumed ASCII instead of UTF-8 * the problematic behaviour of Python 2.x in the specific examples given is "No exception, yes mojibake", and potentially even "Yes exception, yes mojibake" in cases where the implicit ASCII-based decoding could be encountered PEP 538 would then be a follow-on PEP that attempts to resolve the ASCII locale encoding problem not only for CPython itself, but also for any other C/C++ components sharing the same process, or launched in subprocesses that inherit the current environment. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12 January 2017 at 18:45, INADA Naoki <songofacandy@gmail.com> wrote:
Yep, "LC_CTYPE=en_US.UTF-8" is what the current version of PEP 538 has as a fallback if neither C.UTF-8 nor C.utf8 is available. However, that's also the part of PEP 538 that I think can just be dropped entirely and replaced with PEP 540's "assume UTF-8" behaviour instead. I was already considering that as an option when I last updated the PEP (see https://www.python.org/dev/peps/pep-0538/#relationship-with-other-peps ), and now that Victor actually has a working reference implementation for the PEP 540 approach, I think it's the right way to go. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2017-01-12 9:45 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
Does it work to use a locale with encoding A for LC_CTYPE and a locale with encoding B for LC_MESSAGES (and others)? Is there a risk of mojibake? Or do we expect that the POSIX locale speaks ASCII, and so it should work for use UTF-8 for LC_CTYPE since UTF-8 is able to decode messages encoded ASCII? Victor

On Thu, Jan 12, 2017 at 04:25:56PM +0100, Victor Stinner <victor.stinner@gmail.com> wrote:
It does when B is a subset of A (ascii and koi8; ascii and utf8, e.g.)
That works for me: $ echo $LC_CTYPE ru_RU.UTF-8 $ echo $LC_COLLATE ru_RU.UTF-8 $ echo $LANG C $ date Thu Jan 12 19:06:13 MSK 2017
Victor
Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

2017-01-12 17:10 GMT+01:00 Oleg Broytman <phd@phdru.name>:
My question is more when A and B encodings are not compatible. Ah yes, date, thank you for the example. Here is my example using LC_TIME locale to format a date and LC_CTYPE to decode a byte string: date.py: --- import locale, time locale.setlocale(locale.LC_ALL, "") b = time.strftime("%a") encoding=locale.getpreferredencoding() try: u = b.decode(encoding) except UnicodeError: u = '<failed to decode>' else: u = repr(u) print("bytes: %r, text: %s, encoding: %r" % (b, u, encoding)) --- When all locales are the same, it works fine: 목 (U+baa9) is the expected result $ LC_TIME=ko_KR.euckr LANG=ko_KR.euckr python2 date.py bytes: '\xb8\xf1', text: u'\ubaa9', encoding: 'EUC-KR' You get mojibake if LC_CTYPE uses the Latin1 encoding whereas LC_TIME uses the EUC-KR encoding: you get "¸ñ" (U+00b8, U+00f1). $ LC_TIME=ko_KR.euckr LANG=fr_FR python2 date.py bytes: '\xb8\xf1', text: u'\xb8\xf1', encoding: 'ISO-8859-1' The program can also fail with UnicodeDecodeError: $ LC_TIME=ko_KR.euckr LANG=fr_FR.UTF-8 python2 date.py bytes: '\xb8\xf1', text: <failed to decode>, encoding: 'UTF-8' Well, since we are talking about the POSIX locale which usually uses ASCII, it shouldn't be an issue in practice for the PEP 538. I was just curious :-) Victor

On Thu, Jan 12, 2017 at 06:10:35PM +0100, Victor Stinner <victor.stinner@gmail.com> wrote:
Of course you get mojibake. You can get mojibake even with compatible encodings: $ echo $LC_CTYPE ru_RU.KOI8-R $ LC_TIME=ru_RU.UTF-8 date п╖я┌ я▐п╫п╡ 12 20:14:08 MSK 2017 ^^^^^^^^^^^^^^^^^ mojibake! $ echo $LC_CTYPE ru_RU.UTF-8 $ LC_TIME=ru_RU.KOI8-R date ?? ??? 12 20:15:20 MSK 2017 ^^^^^^ mojibake!
Victor
Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Thu, Jan 12, 2017, at 12:10, Victor Stinner wrote:
Time and messages seem to behave differently - everything I tested (including python 2 os.strerror) seems to ignore the LC_MESSAGES encoding and use the LC_CTYPE encoding, including resulting in a bunch of question marks when it's "C".

On Thu, Jan 12, 2017 at 01:04:43PM -0500, Random832 <random832@fastmail.com> wrote:
Works for me as expected: $ echo $LC_CTYPE ru_RU.KOI8-R $ LC_MESSAGES=ru_RU.KOI8-R mc mc speaks to me in Russian... $ LC_MESSAGES=C mc ...English. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Thu, Jan 12, 2017 at 06:14:58PM -0500, Random832 <random832@fastmail.com> wrote:
$ LC_CTYPE=C LC_MESSAGES=ru_RU.KOI8-R mc Brouhaha! mc tries to talk in Russian but converts Russian texts to ascii. Everything is "?????" :-D $ echo $LC_CTYPE ru_RU.UTF-8 $ LC_MESSAGES=ru_RU.UTF-8 mc Russian in utf-8, no problem. What did you expect? I think I did it not the way you wanted. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Fri, Jan 13, 2017 at 11:10:44AM +0900, INADA Naoki <songofacandy@gmail.com> wrote:
Use UTF-8 everywhere, anytime is best way to avoid mojibake.
When you're alone in the Universe -- yes, it helps. But if other people, protocols and data formats use different encodings it doesn't matter which encoding you use -- you'll have mojibake anyway. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

INADA Naoki writes:
Please stop repeating this; it is invalid as an argument. Everybody using Python 3 (which is the only topic for this list) already knows that use of a common universal encoding -- in practice, UTF-8 -- is the way forward (and Windows users also know that the Windows API is a major exception, which proves the rule by being a different *Unicode* transformation format). It is not part of this discussion. The problem is that not everybody does this yet, even today (in fact, that's the source of the problem on containers, people are using the C locale, not C.utf-8!), and some of us have to use or interoperate with systems that don't, even if our own systems do. If your position really is "Screw them, they're stupid -- let them fix their broken systems, it's not our problem", I can understand that but we'll have to agree to disagree. My position is that we need to (1) determine if this change actually can cause problems for Python users on such systems or interoperating with such systems (2) determine how serious the problems are with the "force UTF-8 in certain situations" approach vs. the status quo (3) compare the damage both ways, (4) if there is a conflict, consider whether a modified proposal would work as well or better in more circumstances. I think that is consistent with past Python practice on encoding issues.

On Fri, Jan 13, 2017 at 11:40 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Sorry, I meant "If LC_CTYPE is C or C.UTF-8, all other LC_* should be ASCII or UTF-8. Otherwise, mojibake is not avoidable regardless PEP 538 or 540." I didn't meant forcing C.UTF-8 for LC_TIME too. As you can read, no one propose "Drop non-UTF-8 locale support completely".
No. C locale doesn't forbid using UTF-8. It doesn't determine terminal encoding too. It's just a Python's behavior, and it's unlike many other languages commonly used. That's why many people bitten by this problem. For example, vim uses latin-1 by default for C locale. Since C locale means nothing about terminal/file/stdio encoding, using most common byte transparent encoding seems reasonable choice. Off course, vim can be configured to use UTF-8, regardless LC_CTYPE.
I never said such thing.
Sure. That's what I tested. "If people using non UTF-8 LC_TIME and LC_CTYPE=C, thare is mojibake already. This change doesn't break anything."
Sure. And there is unavoidable conflict, default behavior should be for more common usage.

2017-01-12 9:45 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
As I described in other thread, LC_COLLATE may cause unintentional performance regression and behavior changes.
Since Python 3 uses mostly text, not bytes, LC_COLLATE should not really impact Python applications. Locales set by setlocale() are not inherited by child processes. At least, I understand that setting the locale in Python doesn't impact the performance of child processes: -- import locale, subprocess locale.setlocale(locale.LC_COLLATE, "fr_FR.UTF-8") subprocess.call("sort long_text.txt > /dev/null", shell=True) -- But the LC_COLLATE locale can be used by C libraries called from Python through Python extensions implemented in C. Victor

On 12 January 2017 at 08:15, Victor Stinner <victor.stinner@gmail.com> wrote:
Thanks Victor, I really like this version, and the next time I update PEP 538 I'm going to replace the en_US.UTF-8 fallback in the current proposal with a dependency on this PEP. My one comment would be that in the summary tables, "Always works" isn't the right phrase to describe potentially corrupting text data instead of throwing an exception :) Instead, I think it would make sense to retitle that column as "Exception?" such that: * the ideal state is "No exception, no mojibake", which is what we'll now get when assuming (or forcing) UTF-8 is the correct thing to do, and will also continue to get when the locale is set appropriately (e.g. when handling GB18030 on Chinese systems) * the problematic behaviour of earlier Python 3.x versions was "Yes exception, no mojibake" when it assumed ASCII instead of UTF-8 * the problematic behaviour of Python 2.x in the specific examples given is "No exception, yes mojibake", and potentially even "Yes exception, yes mojibake" in cases where the implicit ASCII-based decoding could be encountered PEP 538 would then be a follow-on PEP that attempts to resolve the ASCII locale encoding problem not only for CPython itself, but also for any other C/C++ components sharing the same process, or launched in subprocesses that inherit the current environment. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12 January 2017 at 18:45, INADA Naoki <songofacandy@gmail.com> wrote:
Yep, "LC_CTYPE=en_US.UTF-8" is what the current version of PEP 538 has as a fallback if neither C.UTF-8 nor C.utf8 is available. However, that's also the part of PEP 538 that I think can just be dropped entirely and replaced with PEP 540's "assume UTF-8" behaviour instead. I was already considering that as an option when I last updated the PEP (see https://www.python.org/dev/peps/pep-0538/#relationship-with-other-peps ), and now that Victor actually has a working reference implementation for the PEP 540 approach, I think it's the right way to go. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2017-01-12 9:45 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
Does it work to use a locale with encoding A for LC_CTYPE and a locale with encoding B for LC_MESSAGES (and others)? Is there a risk of mojibake? Or do we expect that the POSIX locale speaks ASCII, and so it should work for use UTF-8 for LC_CTYPE since UTF-8 is able to decode messages encoded ASCII? Victor

On Thu, Jan 12, 2017 at 04:25:56PM +0100, Victor Stinner <victor.stinner@gmail.com> wrote:
It does when B is a subset of A (ascii and koi8; ascii and utf8, e.g.)
That works for me: $ echo $LC_CTYPE ru_RU.UTF-8 $ echo $LC_COLLATE ru_RU.UTF-8 $ echo $LANG C $ date Thu Jan 12 19:06:13 MSK 2017
Victor
Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

2017-01-12 17:10 GMT+01:00 Oleg Broytman <phd@phdru.name>:
My question is more when A and B encodings are not compatible. Ah yes, date, thank you for the example. Here is my example using LC_TIME locale to format a date and LC_CTYPE to decode a byte string: date.py: --- import locale, time locale.setlocale(locale.LC_ALL, "") b = time.strftime("%a") encoding=locale.getpreferredencoding() try: u = b.decode(encoding) except UnicodeError: u = '<failed to decode>' else: u = repr(u) print("bytes: %r, text: %s, encoding: %r" % (b, u, encoding)) --- When all locales are the same, it works fine: 목 (U+baa9) is the expected result $ LC_TIME=ko_KR.euckr LANG=ko_KR.euckr python2 date.py bytes: '\xb8\xf1', text: u'\ubaa9', encoding: 'EUC-KR' You get mojibake if LC_CTYPE uses the Latin1 encoding whereas LC_TIME uses the EUC-KR encoding: you get "¸ñ" (U+00b8, U+00f1). $ LC_TIME=ko_KR.euckr LANG=fr_FR python2 date.py bytes: '\xb8\xf1', text: u'\xb8\xf1', encoding: 'ISO-8859-1' The program can also fail with UnicodeDecodeError: $ LC_TIME=ko_KR.euckr LANG=fr_FR.UTF-8 python2 date.py bytes: '\xb8\xf1', text: <failed to decode>, encoding: 'UTF-8' Well, since we are talking about the POSIX locale which usually uses ASCII, it shouldn't be an issue in practice for the PEP 538. I was just curious :-) Victor

On Thu, Jan 12, 2017 at 06:10:35PM +0100, Victor Stinner <victor.stinner@gmail.com> wrote:
Of course you get mojibake. You can get mojibake even with compatible encodings: $ echo $LC_CTYPE ru_RU.KOI8-R $ LC_TIME=ru_RU.UTF-8 date п╖я┌ я▐п╫п╡ 12 20:14:08 MSK 2017 ^^^^^^^^^^^^^^^^^ mojibake! $ echo $LC_CTYPE ru_RU.UTF-8 $ LC_TIME=ru_RU.KOI8-R date ?? ??? 12 20:15:20 MSK 2017 ^^^^^^ mojibake!
Victor
Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Thu, Jan 12, 2017, at 12:10, Victor Stinner wrote:
Time and messages seem to behave differently - everything I tested (including python 2 os.strerror) seems to ignore the LC_MESSAGES encoding and use the LC_CTYPE encoding, including resulting in a bunch of question marks when it's "C".

On Thu, Jan 12, 2017 at 01:04:43PM -0500, Random832 <random832@fastmail.com> wrote:
Works for me as expected: $ echo $LC_CTYPE ru_RU.KOI8-R $ LC_MESSAGES=ru_RU.KOI8-R mc mc speaks to me in Russian... $ LC_MESSAGES=C mc ...English. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Thu, Jan 12, 2017 at 06:14:58PM -0500, Random832 <random832@fastmail.com> wrote:
$ LC_CTYPE=C LC_MESSAGES=ru_RU.KOI8-R mc Brouhaha! mc tries to talk in Russian but converts Russian texts to ascii. Everything is "?????" :-D $ echo $LC_CTYPE ru_RU.UTF-8 $ LC_MESSAGES=ru_RU.UTF-8 mc Russian in utf-8, no problem. What did you expect? I think I did it not the way you wanted. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Fri, Jan 13, 2017 at 11:10:44AM +0900, INADA Naoki <songofacandy@gmail.com> wrote:
Use UTF-8 everywhere, anytime is best way to avoid mojibake.
When you're alone in the Universe -- yes, it helps. But if other people, protocols and data formats use different encodings it doesn't matter which encoding you use -- you'll have mojibake anyway. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

INADA Naoki writes:
Please stop repeating this; it is invalid as an argument. Everybody using Python 3 (which is the only topic for this list) already knows that use of a common universal encoding -- in practice, UTF-8 -- is the way forward (and Windows users also know that the Windows API is a major exception, which proves the rule by being a different *Unicode* transformation format). It is not part of this discussion. The problem is that not everybody does this yet, even today (in fact, that's the source of the problem on containers, people are using the C locale, not C.utf-8!), and some of us have to use or interoperate with systems that don't, even if our own systems do. If your position really is "Screw them, they're stupid -- let them fix their broken systems, it's not our problem", I can understand that but we'll have to agree to disagree. My position is that we need to (1) determine if this change actually can cause problems for Python users on such systems or interoperating with such systems (2) determine how serious the problems are with the "force UTF-8 in certain situations" approach vs. the status quo (3) compare the damage both ways, (4) if there is a conflict, consider whether a modified proposal would work as well or better in more circumstances. I think that is consistent with past Python practice on encoding issues.

On Fri, Jan 13, 2017 at 11:40 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Sorry, I meant "If LC_CTYPE is C or C.UTF-8, all other LC_* should be ASCII or UTF-8. Otherwise, mojibake is not avoidable regardless PEP 538 or 540." I didn't meant forcing C.UTF-8 for LC_TIME too. As you can read, no one propose "Drop non-UTF-8 locale support completely".
No. C locale doesn't forbid using UTF-8. It doesn't determine terminal encoding too. It's just a Python's behavior, and it's unlike many other languages commonly used. That's why many people bitten by this problem. For example, vim uses latin-1 by default for C locale. Since C locale means nothing about terminal/file/stdio encoding, using most common byte transparent encoding seems reasonable choice. Off course, vim can be configured to use UTF-8, regardless LC_CTYPE.
I never said such thing.
Sure. That's what I tested. "If people using non UTF-8 LC_TIME and LC_CTYPE=C, thare is mojibake already. This change doesn't break anything."
Sure. And there is unavoidable conflict, default behavior should be for more common usage.

2017-01-12 9:45 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
As I described in other thread, LC_COLLATE may cause unintentional performance regression and behavior changes.
Since Python 3 uses mostly text, not bytes, LC_COLLATE should not really impact Python applications. Locales set by setlocale() are not inherited by child processes. At least, I understand that setting the locale in Python doesn't impact the performance of child processes: -- import locale, subprocess locale.setlocale(locale.LC_COLLATE, "fr_FR.UTF-8") subprocess.call("sort long_text.txt > /dev/null", shell=True) -- But the LC_COLLATE locale can be used by C libraries called from Python through Python extensions implemented in C. Victor
participants (7)
-
INADA Naoki
-
Nick Coghlan
-
Oleg Broytman
-
Random832
-
Stephen J. Turnbull
-
Terry Reedy
-
Victor Stinner