
Hi Inada-san, I am currently too busy with EuroPython to participate in longer discussions. FWIW: I intend to continue after EuroPython. In any case, thanks for writing up the PEP. Could you please add my points about: - the fact that the encode APIs encoding from a Unicode buffer to a bytes object; this is an important fact, since the removal removes access to this codec functionality for extensions - PyUnicode_AsEncodedString() is not a proper alternative, since it requires to create a temporary PyUnicode object, which is inefficient and wastes memory - the maintenance effect mentioned in the PEP does not really materialize, since the underlying functionality still exists in the codecs - only access to the functionality is removed - keeping just the generic PyUnicode_Encode() API would be a compromise - if we remove the codec specific PyUnicode_Encode*() APIs, why are we still keeping the specisl PyUnicde_Decode*() APIs ? - the deprecations were just done because the Py_UNICODE data type was replaced by a hybrid type. Using this as an argument for removing functionality is not really good practice, when these are ways to continue exposing the functionality using other data types. I am still strongly -1 on removing all encoding APIs without at least some upgrade path for existing code to use and keeping the API symmetric. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ On 07.07.2020 17:17, Inada Naoki wrote:
Hi, folks.
Since the previous discussion was suspended without consensus, I wrote a new PEP for it. (Thank you Victor for reviewing it!)
This PEP looks very similar to PEP 623 "Remove wstr from Unicode", but for encoder APIs, not for Unicode object APIs.
URL (not available yet): https://www.python.org/dev/peps/pep-0624/
---
PEP: 624 Title: Remove Py_UNICODE encoder APIs Author: Inada Naoki <songofacandy@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 06-Jul-2020 Python-Version: 3.11
Abstract ========
This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11:
* ``PyUnicode_Encode()`` * ``PyUnicode_EncodeASCII()`` * ``PyUnicode_EncodeLatin1()`` * ``PyUnicode_EncodeUTF7()`` * ``PyUnicode_EncodeUTF8()`` * ``PyUnicode_EncodeUTF16()`` * ``PyUnicode_EncodeUTF32()`` * ``PyUnicode_EncodeUnicodeEscape()`` * ``PyUnicode_EncodeRawUnicodeEscape()`` * ``PyUnicode_EncodeCharmap()`` * ``PyUnicode_TranslateCharmap()`` * ``PyUnicode_EncodeDecimal()`` * ``PyUnicode_TransformDecimalToASCII()``
.. note::
`PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP is not relating to Unicode object. These PEPs are split because they have different motivation and need different discussion.
Motivation ==========
In general, reducing the number of APIs that have been deprecated for a long time and have few users is a good idea for not only it improves the maintainability of CPython, but it also helps API users and other Python implementations.
Rationale =========
Deprecated since Python 3.3 ---------------------------
``Py_UNICODE`` and APIs using it are deprecated since Python 3.3.
Inefficient -----------
All of these APIs are implemented using ``PyUnicode_FromWideChar``. So these APIs are inefficient when user want to encode Unicode object.
Not used widely ---------------
When searching from top 4000 PyPI packages [1]_, only pyodbc use these APIs.
* ``PyUnicode_EncodeUTF8()`` * ``PyUnicode_EncodeUTF16()``
pyodbc uses these APIs to encode Unicode object into bytes object. So it is easy to fix it. [2]_
Alternative APIs ================
There are alternative APIs to accept ``PyObject *unicode`` instead of ``Py_UNICODE *``. Users can migrate to them.
========================================= ========================================== Deprecated API Alternative APIs ========================================= ========================================== ``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()`` ``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1) ``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1) ``PyUnicode_EncodeUTF7()`` \(2) ``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1) ``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3) ``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3) ``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()`` ``PyUnicode_EncodeRawUnicodeEscape()`` ``PyUnicode_AsRawUnicodeEscapeString()`` ``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1) ``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()`` ``PyUnicode_EncodeDecimal()`` \(4) ``PyUnicode_TransformDecimalToASCII()`` \(4) ========================================= ==========================================
Notes:
(1) ``const char *errors`` parameter is missing.
(2) There is no public alternative API. But user can use generic ``PyUnicode_AsEncodedString()`` instead.
(3) ``const char *errors, int byteorder`` parameters are missing.
(4) There is no direct replacement. But ``Py_UNICODE_TODECIMAL`` can be used instead. CPython uses ``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting from Unicode to numbers instead.
Plan ====
Python 3.9 ----------
Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)`` already.
* ``PyUnicode_EncodeDecimal()`` * ``PyUnicode_TransformDecimalToASCII()``.
Document all APIs as "will be removed in version 3.11".
Python 3.11 -----------
These APIs are removed.
* ``PyUnicode_Encode()`` * ``PyUnicode_EncodeASCII()`` * ``PyUnicode_EncodeLatin1()`` * ``PyUnicode_EncodeUTF7()`` * ``PyUnicode_EncodeUTF8()`` * ``PyUnicode_EncodeUTF16()`` * ``PyUnicode_EncodeUTF32()`` * ``PyUnicode_EncodeUnicodeEscape()`` * ``PyUnicode_EncodeRawUnicodeEscape()`` * ``PyUnicode_EncodeCharmap()`` * ``PyUnicode_TranslateCharmap()`` * ``PyUnicode_EncodeDecimal()`` * ``PyUnicode_TransformDecimalToASCII()``
Alternative ideas =================
Instead of just removing deprecated APIs, we may be able to use thier names with different signature.
Make some private APIs public ------------------------------
``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.
Some APIs have alternative public APIs. But they are missing ``const char *errors`` or ``int byteorder`` parameters.
We can rename some private APIs and make them public to cover missing APIs and parameters.
============================= ================================ Rename to Rename from ============================= ================================ ``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()`` ``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()`` ``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()`` ``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()`` ``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()`` ``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()`` ============================= ================================
Pros:
* We have more consistent API set.
Cons:
* We have more public APIs to maintain. * Existing public APIs are enough for most use cases, and ``PyUnicode_AsEncodedString()`` can be used in other cases.
Replace ``Py_UNICODE*`` with ``Py_UCS4*`` -----------------------------------------
We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with ``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to convert ``Py_UCS4*`` string to Unicode object.
Pros:
* We have more consistent API set. * User can encode UCS-4 string in C without creating Unicode object.
Cons:
* We have more public APIs to maintain. * Applications which uses UTF-8 or UTF-32 can not use these APIs anyway. * Other Python implementations may not have builtin codec for UCS-4. * If we change the Unicode internal representation to UTF-8, we need to keep UCS-4 support only for these APIs.
Replace ``Py_UNICODE*`` with ``wchar_t*`` -----------------------------------------
We can replace ``Py_UNICODE`` to ``wchar_t``.
Pros:
* We have more consistent API set. * Backward compatible.
Cons:
* We have more public APIs to maintain. * They are inefficient on platforms ``wchar_t*`` is UTF-16. It is because built-in codecs supports only UCS-1, UCS-2, and UCS-4 input.
Rejected ideas ==============
Using runtime warning ---------------------
These APIs doesn't release GIL for now. Emitting a warning from such APIs is not safe. See this example.
.. code-block::
PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference. PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u), PyUnicode_GET_SIZE(u), NULL); // Assumes u is still living reference. PyObject *t = PyTuple_Pack(2, u, b); Py_DECREF(b); return t;
If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning filters and other threads may change the ``list`` and ``u`` can be a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.
Additionally, since we are not changing behavior but removing C APIs, runtime ``DeprecationWarning`` might not helpful for Python developers. We should warn to extension developers instead.
Discussions ===========
* `Plan to remove Py_UNICODE APis except PEP 623 <https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_ * `bpo-41123: Remove Py_UNICODE APIs except PEP 623: <https://bugs.python.org/issue41123>`_
References ==========
.. [1] Source package list chosen from top 4000 PyPI packages. (https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.t...)
.. [2] pyodbc -- Don't use PyUnicode_Encode API #792 (https://github.com/mkleehammer/pyodbc/pull/792)
.. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318) (https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e...)
Copyright =========
This document has been placed in the public domain.