<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><br><br><blockquote type="cite"><div><span></span><br><span>I made the following two changes to the PEP 540:</span><br><span></span><br><span>* open() error handler remains "strict"</span><br><span>* remove the "Strict UTF8 mode" which doesn't make much sense anymore</span><br></div></blockquote><div><br></div><div>+1 — ignore my previous note.</div><div><br></div><div>-CHB</div><br><blockquote type="cite"><div><span></span><br><span>I wrote the Strict UTF-8 mode when open() used surrogateescape error</span><br><span>handler in the UTF-8 mode. I don't think that a Strict UTF-8 mode is</span><br><span>required just to change the error handler of stdin and stdout. Well,</span><br><span>read the "Passthough undecodable bytes: surrogateescape" section of</span><br><span>the PEP rationale :-)</span><br><span></span><br><span></span><br><span><a href="https://www.python.org/dev/peps/pep-0540/">https://www.python.org/dev/peps/pep-0540/</a></span><br><span></span><br><span>Victor</span><br><span></span><br><span></span><br><span>PEP: 540</span><br><span>Title: Add a new UTF-8 mode</span><br><span>Version: $Revision$</span><br><span>Last-Modified: $Date$</span><br><span>Author: Victor Stinner <<a href="mailto:victor.stinner@gmail.com">victor.stinner@gmail.com</a>></span><br><span>BDFL-Delegate: INADA Naoki</span><br><span>Status: Draft</span><br><span>Type: Standards Track</span><br><span>Content-Type: text/x-rst</span><br><span>Created: 5-January-2016</span><br><span>Python-Version: 3.7</span><br><span></span><br><span></span><br><span>Abstract</span><br><span>========</span><br><span></span><br><span>Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and</span><br><span>change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.</span><br><span>This mode is enabled by default in the POSIX locale, but otherwise</span><br><span>disabled by default.</span><br><span></span><br><span>The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment</span><br><span>variable are added to control the UTF-8 mode.</span><br><span></span><br><span></span><br><span>Rationale</span><br><span>=========</span><br><span></span><br><span>Locale encoding and UTF-8</span><br><span>-------------------------</span><br><span></span><br><span>Python 3.6 uses the locale encoding for filenames, environment</span><br><span>variables, standard streams, etc. The locale encoding is inherited from</span><br><span>the locale; the encoding and the locale are tightly coupled.</span><br><span></span><br><span>Many users inherit the ASCII encoding from the POSIX locale, aka the "C"</span><br><span>locale, but are unable change the locale for different reasons. This</span><br><span>encoding is very limited in term of Unicode support: any non-ASCII</span><br><span>character is likely to cause troubles.</span><br><span></span><br><span>It is not easy to get the expected locale. Locales don't get the exact</span><br><span>same name on all Linux distributions, FreeBSD, macOS, etc. Some</span><br><span>locales, like the recent ``C.UTF-8`` locale, are only supported by a few</span><br><span>platforms. For example, a SSH connection can use a different encoding</span><br><span>than the filesystem or terminal encoding of the local host.</span><br><span></span><br><span>On the other side, Python 3.6 is already using UTF-8 by default on</span><br><span>macOS, Android and Windows (PEP 529) for most functions, except of</span><br><span>``open()``. UTF-8 is also the default encoding of Python scripts, XML</span><br><span>and JSON file formats. The Go programming language uses UTF-8 for</span><br><span>strings.</span><br><span></span><br><span>When all data are stored as UTF-8 but the locale is often misconfigured,</span><br><span>an obvious solution is to ignore the locale and use UTF-8.</span><br><span></span><br><span>PEP 538 attempts to mitigate this problem by coercing the C locale</span><br><span>to a UTF-8 based locale when one is available, but that isn't a</span><br><span>universal solution. For example, CentOS 7's container images default</span><br><span>to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's</span><br><span>locale coercion is ineffective.</span><br><span></span><br><span></span><br><span>Passthough undecodable bytes: surrogateescape</span><br><span>---------------------------------------------</span><br><span></span><br><span>When decoding bytes from UTF-8 using the ``strict`` error handler, which</span><br><span>is the default, Python 3 raises a ``UnicodeDecodeError`` on the first</span><br><span>undecodable byte.</span><br><span></span><br><span>Unix command line tools like ``cat`` or ``grep`` and most Python 2</span><br><span>applications simply do not have this class of bugs: they don't decode</span><br><span>data, but process data as a raw bytes sequence.</span><br><span></span><br><span>Python 3 already has a solution to behave like Unix tools and Python 2:</span><br><span>the ``surrogateescape`` error handler (:pep:`383`). It allows to process</span><br><span>data "as bytes" but uses Unicode in practice (undecodable bytes are</span><br><span>stored as surrogate characters).</span><br><span></span><br><span>The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``</span><br><span>and ``stdout`` since these streams as commonly associated to Unix</span><br><span>command line tools.</span><br><span></span><br><span>However, users have a different expectation on files. Files are expected</span><br><span>to be properly encoded. Python is expected to fail early when ``open()``</span><br><span>is called with the wrong options, like opening a JPEG picture in text</span><br><span>mode. The ``open()`` default error handler remains ``strict`` for these</span><br><span>reasons.</span><br><span></span><br><span></span><br><span>No change by default for best backward compatibility</span><br><span>----------------------------------------------------</span><br><span></span><br><span>While UTF-8 is perfect in most cases, sometimes the locale encoding is</span><br><span>actually the best encoding.</span><br><span></span><br><span>This PEP changes the behaviour for the POSIX locale since this locale</span><br><span>usually gives the ASCII encoding, whereas UTF-8 is a much better choice.</span><br><span>It does not change the behaviour for other locales to prevent any risk</span><br><span>or regression.</span><br><span></span><br><span>As users are responsible to enable explicitly the new UTF-8 mode, they</span><br><span>are responsible for any potential mojibake issues caused by this mode.</span><br><span></span><br><span></span><br><span>Proposal</span><br><span>========</span><br><span></span><br><span>Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and</span><br><span>change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.</span><br><span>This mode is enabled by default in the POSIX locale, but otherwise</span><br><span>disabled by default.</span><br><span></span><br><span>The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment</span><br><span>variable are added. The UTF-8 mode is enabled by ``-X utf8`` or</span><br><span>``PYTHONUTF8=1``.</span><br><span></span><br><span>The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode</span><br><span>can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.</span><br><span></span><br><span>For standard streams, the ``PYTHONIOENCODING`` environment variable has</span><br><span>priority over the UTF-8 mode.</span><br><span></span><br><span>On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable</span><br><span>(:pep:`529`) has the priority over the UTF-8 mode.</span><br><span></span><br><span></span><br><span>Backward Compatibility</span><br><span>======================</span><br><span></span><br><span>The only backward incompatible change is that the UTF-8 encoding is now</span><br><span>used for the POSIX locale.</span><br><span></span><br><span></span><br><span>Annex: Encodings And Error Handlers</span><br><span>===================================</span><br><span></span><br><span>The UTF-8 mode changes the default encoding and error handler used by</span><br><span>``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,</span><br><span>``sys.stdout`` and ``sys.stderr``.</span><br><span></span><br><span>Encoding and error handler</span><br><span>--------------------------</span><br><span></span><br><span>============================ =======================</span><br><span>==========================</span><br><span>Function Default UTF-8 mode or</span><br><span>POSIX locale</span><br><span>============================ =======================</span><br><span>==========================</span><br><span>open() locale/strict **UTF-8**/strict</span><br><span>os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape</span><br><span>sys.stdin, sys.stdout locale/strict **UTF-8/surrogateescape**</span><br><span>sys.stderr locale/backslashreplace</span><br><span>**UTF-8**/backslashreplace</span><br><span>============================ =======================</span><br><span>==========================</span><br><span></span><br><span>By comparison, Python 3.6 uses:</span><br><span></span><br><span>============================ =======================</span><br><span>==========================</span><br><span>Function Default POSIX locale</span><br><span>============================ =======================</span><br><span>==========================</span><br><span>open() locale/strict locale/strict</span><br><span>os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape</span><br><span>sys.stdin, sys.stdout locale/strict</span><br><span>locale/**surrogateescape**</span><br><span>sys.stderr locale/backslashreplace locale/backslashreplace</span><br><span>============================ =======================</span><br><span>==========================</span><br><span></span><br><span>Encoding and error handler on Windows</span><br><span>-------------------------------------</span><br><span></span><br><span>On Windows, the encodings and error handlers are different:</span><br><span></span><br><span>============================ =======================</span><br><span>========================== ==========================</span><br><span>Function Default Legacy Windows</span><br><span>FS encoding UTF-8 mode</span><br><span>============================ =======================</span><br><span>========================== ==========================</span><br><span>open() mbcs/strict mbcs/strict</span><br><span> **UTF-8**/strict</span><br><span>os.fsdecode(), os.fsencode() UTF-8/surrogatepass</span><br><span>**mbcs/replace** UTF-8/surrogatepass</span><br><span>sys.stdin, sys.stdout UTF-8/surrogateescape</span><br><span>UTF-8/surrogateescape UTF-8/surrogateescape</span><br><span>sys.stderr UTF-8/backslashreplace</span><br><span>UTF-8/backslashreplace UTF-8/backslashreplace</span><br><span>============================ =======================</span><br><span>========================== ==========================</span><br><span></span><br><span>By comparison, Python 3.6 uses:</span><br><span></span><br><span>============================ =======================</span><br><span>==========================</span><br><span>Function Default Legacy Windows</span><br><span>FS encoding</span><br><span>============================ =======================</span><br><span>==========================</span><br><span>open() mbcs/strict mbcs/strict</span><br><span>os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace**</span><br><span>sys.stdin, sys.stdout UTF-8/surrogateescape UTF-8/surrogateescape</span><br><span>sys.stderr UTF-8/backslashreplace UTF-8/backslashreplace</span><br><span>============================ =======================</span><br><span>==========================</span><br><span></span><br><span>The "Legacy Windows FS encoding" is enabled by the</span><br><span>``PYTHONLEGACYWINDOWSFSENCODING`` environment variable.</span><br><span></span><br><span>If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or</span><br><span>``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But</span><br><span>in the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8</span><br><span>encoding.</span><br><span></span><br><span>.. note:</span><br><span> There is no POSIX locale on Windows. The ANSI code page is used to the</span><br><span> locale encoding, and this code page never uses the ASCII encoding.</span><br><span></span><br><span></span><br><span>Annex: Differences between PEP 538 and PEP 540</span><br><span>==============================================</span><br><span></span><br><span>PEP 538's locale coercion is only effective if a suitable UTF-8</span><br><span>based locale is available as a coercion target. PEP 540's</span><br><span>UTF-8 mode can be enabled even for operating systems that don't</span><br><span>provide a suitable platform locale (such as CentOS 7).</span><br><span></span><br><span>PEP 538 only changes the interpreter's behaviour for the C locale. While the</span><br><span>new UTF-8 mode of this PEP is only enabled by default in the C locale, it can</span><br><span>also be enabled manually for any other locale.</span><br><span></span><br><span>PEP 538 is implemented with ``setlocale(LC_CTYPE, "<coercion target>")`` and</span><br><span>``setenv("LC_CTYPE", "<coercion target>")``, so any non-Python code running</span><br><span>in the process and any subprocesses that inherit the environment is impacted</span><br><span>by the change. PEP 540 is implemented in Python internals and ignores the</span><br><span>locale: non-Python running in the same process is not aware of the</span><br><span>"Python UTF-8 mode". The benefit of the PEP 538 approach is that it helps</span><br><span>ensure that encoding handling in binary extension modules and subprocesses</span><br><span>is consistent with CPython's encoding handling. The upside of the PEP 540</span><br><span>approach is that it allows an embedding application to change the</span><br><span>interpreter's behaviour without having to change the process global</span><br><span>locale settings.</span><br><span></span><br><span></span><br><span>Links</span><br><span>=====</span><br><span></span><br><span>* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode</span><br><span> <<a href="http://bugs.python.org/issue29240">http://bugs.python.org/issue29240</a>>`_</span><br><span>* `PEP 538 <<a href="https://www.python.org/dev/peps/pep-0538/">https://www.python.org/dev/peps/pep-0538/</a>>`_:</span><br><span> "Coercing the legacy C locale to C.UTF-8"</span><br><span>* `PEP 529 <<a href="https://www.python.org/dev/peps/pep-0529/">https://www.python.org/dev/peps/pep-0529/</a>>`_:</span><br><span> "Change Windows filesystem encoding to UTF-8"</span><br><span>* `PEP 528 <<a href="https://www.python.org/dev/peps/pep-0528/">https://www.python.org/dev/peps/pep-0528/</a>>`_:</span><br><span> "Change Windows console encoding to UTF-8"</span><br><span>* `PEP 383 <<a href="https://www.python.org/dev/peps/pep-0383/">https://www.python.org/dev/peps/pep-0383/</a>>`_:</span><br><span> "Non-decodable Bytes in System Character Interfaces"</span><br><span></span><br><span></span><br><span>Post History</span><br><span>============</span><br><span></span><br><span>* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode</span><br><span> <<a href="https://mail.python.org/pipermail/python-dev/2017-December/151054.html">https://mail.python.org/pipermail/python-dev/2017-December/151054.html</a>>`_</span><br><span>* 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &</span><br><span> 540 (assuming UTF-8 for *nix system boundaries)</span><br><span> <<a href="https://mail.python.org/pipermail/python-dev/2017-April/147795.html">https://mail.python.org/pipermail/python-dev/2017-April/147795.html</a>>`_</span><br><span>* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode</span><br><span> <<a href="https://mail.python.org/pipermail/python-ideas/2017-January/044089.html">https://mail.python.org/pipermail/python-ideas/2017-January/044089.html</a>>`_</span><br><span>* 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to</span><br><span> C.utf-8 (msg284764) <<a href="https://bugs.python.org/issue28180#msg284764">https://bugs.python.org/issue28180#msg284764</a>>`_</span><br><span>* 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows</span><br><span> to UTF-8 (msg272916) <<a href="https://bugs.python.org/issue27781#msg272916">https://bugs.python.org/issue27781#msg272916</a>>`_</span><br><span> -- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows</span><br><span> filesystem encoding to UTF-8)</span><br><span></span><br><span></span><br><span>Copyright</span><br><span>=========</span><br><span></span><br><span>This document has been placed in the public domain.</span><br><span>_______________________________________________</span><br><span>Python-Dev mailing list</span><br><span><a href="mailto:Python-Dev@python.org">Python-Dev@python.org</a></span><br><span><a href="https://mail.python.org/mailman/listinfo/python-dev">https://mail.python.org/mailman/listinfo/python-dev</a></span><br><span>Unsubscribe: <a href="https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov">https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov</a></span><br><span></span><br></div></blockquote></body></html>