PEP 597: Add optional EncodingWarning
Hi, all. I updated the PEP 597 yesterday. Please review it to move it forward. PEP: https://www.python.org/dev/peps/pep-0597/ Previous thread: https://discuss.python.org/t/3880 Main difference from the previous version: * Added new warning category; EncodingWarning * Added dedicated option to enable the warning instead of using dev mode. Abstract ======== Add a new warning category ``EncodingWarning``. It is emitted when ``encoding`` option is omitted and the default encoding is a locale encoding. The warning is disabled by default. New ``-X warn_encoding`` command-line option and ``PYTHONWARNENCODING`` environment variable are used to enable the warnings. Motivation ========== Using the default encoding is a common mistake ---------------------------------------------- Developers using macOS or Linux may forget that the default encoding is not always UTF-8. For example, ``long_description = open("README.md").read()`` in ``setup.py`` is a common mistake. Many Windows users can not install the package if there is at least one non-ASCII character (e.g. emoji) in the ``README.md`` file which is encoded in UTF-8. For example, 489 packages of the 4000 most downloaded packages from PyPI used non-ASCII characters in README. And 82 packages of them can not be installed from source package when locale encoding is ASCII. [1_] They used the default encoding to read README or TOML file. Another example is ``logging.basicConfig(filename="log.txt")``. Some users expect UTF-8 is used by default, but locale encoding is used actually. [2_] Even Python experts assume that default encoding is UTF-8. It creates bugs that happen only on Windows. See [3_] and [4_]. Emitting a warning when the ``encoding`` option is omitted will help to find such mistakes. Prepare to change the default encoding to UTF-8 ----------------------------------------------- We had chosen to use locale encoding for the default text encoding in Python 3.0. But UTF-8 has been adopted very widely since then. We might change the default text encoding to UTF-8 in the future. But this change will affect many applications and libraries. Many ``DeprecationWarning`` will be emitted if we start emitting the warning by default. It will be too noisy. Although this PEP doesn't propose to change the default encoding, this PEP will help to reduce the warning in the future if we decide to change the default encoding. Specification ============= ``EncodingWarning`` -------------------- Add new ``EncodingWarning`` warning class which is a subclass of ``Warning``. It is used to warn when the ``encoding`` option is omitted and the default encoding is locale-specific. Options to enable the warning ------------------------------ ``-X warn_encoding`` option and the ``PYTHONWARNENCODING`` environment variable are added. They are used to enable the ``EncodingWarning``. ``sys.flags.encoding_warning`` is also added. The flag represents ``EncodingWarning`` is enabled. When the option is enabled, ``io.TextIOWrapper()``, ``open()``, and other modules using them will emit ``EncodingWarning`` when ``encoding`` is omitted. ``encoding="locale"`` option ---------------------------- ``io.TextIOWrapper`` accepts ``encoding="locale"`` option. It means same to current ``encoding=None``. But ``io.TextIOWrapper`` doesn't emit ``EncodingWarning`` when ``encoding="locale"`` is specified. Add ``io.LOCALE_ENCODING = "locale"`` constant too. This constant can be used to avoid confusing ``LookupError: unknown encoding: locale`` error when the code is run in old Python accidentally. The constant can be used to test that ``encoding="locale"`` option is supported too. For example, .. code-block:: # Want to suppress an EncodingWarning but still need support # old Python versions. locale_encoding = getattr(io, "LOCALE_ENCODING", None) with open(filename, encoding=locale_encoding) as f: ... ``io.text_encoding()`` ----------------------- ``io.text_encoding()`` is a helper function for functions having ``encoding=None`` option and passing it to ``io.TextIOWrapper()`` or ``open()``. Pure Python implementation will be like this:: def text_encoding(encoding, stacklevel=1): """Helper function to choose the text encoding. When *encoding* is not None, just return it. Otherwise, return the default text encoding (i.e., "locale"). This function emits EncodingWarning if *encoding* is None and sys.flags.encoding_warning is true. This function can be used in APIs having encoding=None option and pass it to TextIOWrapper or open. But please consider using encoding="utf-8" for new APIs. """ if encoding is None: if sys.flags.encoding_warning: import warnings warnings.warn("'encoding' option is omitted", EncodingWarning, stacklevel + 2) encoding = LOCALE_ENCODING return encoding For example, ``pathlib.Path.read_text()`` can use the function like: .. code-block:: def read_text(self, encoding=None, errors=None): encoding = io.text_encoding(encoding) with self.open(mode='r', encoding=encoding, errors=errors) as f: return f.read() By using ``io.text_encoding()``, ``EncodingWarning`` is emitted for the caller of ``read_text()`` instead of ``read_text()``. Affected stdlibs ------------------- Many stdlibs will be affected by this change. Most APIs accepting ``encoding=None`` will use ``io.text_encoding()`` as written in the previous section. Where using locale encoding as the default encoding is reasonable, ``encoding=io.LOCALE_ENCODING`` will be used instead. For example, ``subprocess`` module will use locale encoding for the default encoding of the pipes. Many tests use ``open()`` without ``encoding`` specified to read ASCII text files. They should be rewritten with ``encoding="ascii"``. Rationale ========= Opt-in warning --------------- Although ``DeprecationWarning`` is suppressed by default, emitting ``DeprecationWarning`` always when ``encoding`` option is omitted would be too noisy. Noisy warnings may lead developers to dismiss the ``DeprecationWarning``. "locale" is not a codec alias ----------------------------- We don't add the "locale" to the codec alias because locale can be changed in runtime. Additionally, ``TextIOWrapper`` checks ``os.device_encoding()`` when ``encoding=None``. This behavior can not be implemented in the codec. Reference Implementation ======================== https://github.com/python/cpython/pull/19481 References ========== .. [1] "Packages can't be installed when encoding is not UTF-8" (https://github.com/methane/pep597-pypi-ascii) .. [2] "Logging - Inconsistent behaviour when handling unicode" (https://bugs.python.org/issue37111) .. [3] Packaging tutorial in packaging.python.org didn't specify encoding to read a ``README.md`` (https://github.com/pypa/packaging.python.org/pull/682) .. [4] ``json.tool`` had used locale encoding to read JSON files. (https://bugs.python.org/issue33684) Copyright ========= This document has been placed in the public domain.
participants (6)
-
Inada Naoki
-
Jim J. Jewett
-
Jonathan Goble
-
Paul Moore
-
Terry Reedy
-
Victor Stinner