Draft PEP: Remove wstr from Unicode
PEP: 9999 Title: Remove wstr from Unicode Author: Inada Naoki <songofacandy@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 18-Jun-2020 Python-Version: TBD Abstract ======== PEP 393 deprecated some unicode APIs, and introduced ``wchar_t *wstr``, and ``Py_ssize_t wstr_length`` in unicode implementation for backward compatibility of these deprecated APIs. [1]_ This PEP is planning removal of ``wstr``, and ``wstr_length`` with deprecated APIs using these members. Motivation ========== Memory usage ------------ ``str`` is one of the most used types in Python. Even most simple ASCII strings have a ``wstr`` member. It consumes 8 bytes on 64bit systems. Runtime overhead ---------------- To support legacy Unicode object created by ``PyUnicode_FromUnicode(NULL, length)``, many Unicode APIs has ``PyUnicode_READY()`` check. When we drop support of legacy unicode object, We can reduce this overhead too. Simplicity ---------- Support of legacy Unicode object makes Unicode implementation complex. Until we drop legacy Unicode object, it is very hard to try other Unicode implementation like UTF-8 based implementation in PyPy. Specification ============= Affected APIs -------------- From the Unicode implementation, ``wstr`` and ``wstr_length`` members are removed. Macros and functions to be removed: * PyUnicode_GET_SIZE * PyUnicode_GET_DATA_SIZE * Py_UNICODE_WSTR_LENGTH * PyUnicode_AS_UNICODE * PyUnicode_AS_DATA * PyUnicode_AsUnicode * PyUnicode_AsUnicodeAndSize Behaviors to be removed: * PyUnicode_FromUnicode -- ``PyUnicode_FromUnicode(NULL, size)`` where ``size > 0`` cause RuntimeError instead of creating legacy Unicode object. While this API is deprecated by PEP 393, this API will be kept when ``wstr`` is removed. This API will be removed later. * PyUnicode_FromStringAndSize -- Like PyUnicode_FromUnicode, ``PyUnicode_FromStringAndSize(NULL, size)`` cause RuntimeError instead of creating legacy unicode object. * PyArg_ParseTuple, PyArg_ParseTupleAndKeywords -- 'u', 'u#', 'Z', and 'Z#' format will be removed. Deprecation ----------- All APIs to be removed should have compiler deprecation warning (e.g. `Py_DEPRECATED(3.3)`) from Python 3.9. [2]_ All APIs to be changed should raise DeprecationWarning for behavior to be removed. Note that ``PyUnicode_FromUnicode`` has both of compiler deprecation warning and runtime DeprecationWarning. [3]_, [4]_. Plan ----- All deprecations will be implemented in Python 3.10. Some deprecations will be backported in Python 3.9. Actual removal will happen in Python 3.12. Alternative Ideas ================= Advanced Schedule ----------------- Backport warnings in 3.9, and do the removal in early development phase in Python 3.11. If many third packages are broken by this change, we will revert the change and back to the regular schedule. Pros: There is a chance to remove ``wstr`` in Python 3.11. Even if we need to revert it, third party maintainers can have more time to prepare the removal and we can get feedback from the community early. Cons: Adding warnings in beta period will make some confusion. Note that we need to avoid the warning from CPython core and stdlib. Use hashtable to store wstr --------------------------- Store the ``wstr`` in a hashtable, instead of Unicode structure. Pros: We can save memory usage even from Python 3.10. We can have more longer timeline to remove the ``wstr``. Cons: This implementation will increase the complexity of Unicode implementation. References ========== A collection of URLs used as references through the PEP. .. [1] PEP 393 -- Flexible String Representation (https://www.python.org/dev/peps/pep-0393/) .. [2] GH-20878 -- Add Py_DEPRECATED to deprecated unicode APIs (https://github.com/python/cpython/pull/20878) .. [3] GH-20933 -- Raise DeprecationWarning when creating legacy Unicode (https://github.com/python/cpython/pull/20933) .. [4] GH-20927 -- Raise DeprecationWarning for getargs with 'u', 'Z' #20927 (https://github.com/python/cpython/pull/20927) Copyright ========= This document has been placed in the public domain. -- Inada Naoki <songofacandy@gmail.com>
Hi INADA-san, First of all, thanks for writing down a PEP! Le jeu. 18 juin 2020 à 11:42, Inada Naoki <songofacandy@gmail.com> a écrit :
To support legacy Unicode object created by ``PyUnicode_FromUnicode(NULL, length)``, many Unicode APIs has ``PyUnicode_READY()`` check.
I don't see PyUnicode_READY() removal in the specification section. When can we remove these calls and the function itself?
Support of legacy Unicode object makes Unicode implementation complex. Until we drop legacy Unicode object, it is very hard to try other Unicode implementation like UTF-8 based implementation in PyPy.
I'm not sure if it should be in the scope of the PEP or not, but there are also other C API functions which are too close to the PEP 393 concrete implementation. For example, I'm not sure that PyUnicode_MAX_CHAR_VALUE(str) would be relevant/efficient if Python str is reimplemented to use UTF-8 internally. Should we deprecate it as well? Do you think that it should be addressed in a separated PEP? In fact, a large part of the Unicode C API is based on the current implementation of the Python str type. For example, I'm not sure that PyUnicode_New(size, max_char) would still make sense if we change the code to store strings as UTF-8 internally. In an ideal world, I would prefer to have a "string builder" API, like the current _PyUnicodeWriter C API, to create a string, and only never allow to modify a string in-place. CPython "almost" immutable str "if reference count is equal to 1" has corner cases and can be misused. But again, I don't think that it should be part of this PEP :-) Sorry for being off-topic ;-)
Specification =============
Affected APIs --------------
From the Unicode implementation, ``wstr`` and ``wstr_length`` members are removed.
Macros and functions to be removed:
* PyUnicode_GET_SIZE * PyUnicode_GET_DATA_SIZE * Py_UNICODE_WSTR_LENGTH * PyUnicode_AS_UNICODE * PyUnicode_AS_DATA * PyUnicode_AsUnicode * PyUnicode_AsUnicodeAndSize
Which ones are already deprecated?
Behaviors to be removed:
* PyUnicode_FromUnicode -- ``PyUnicode_FromUnicode(NULL, size)`` where ``size > 0`` cause RuntimeError instead of creating legacy Unicode object. While this API is deprecated by PEP 393, this API will be kept when ``wstr`` is removed. This API will be removed later.
I'm not sure that it's relevant to keep PyUnicode_FromUnicode() whereas PyUnicode_FromWideChar() has a clean API (use wchar_t*, not Py_UNICODE*). I also suggest to disallow PyUnicode_FromUnicode(NULL, 0) as well. By the way, when can we finally remove the Py_UNICODE type? I would prefer to remove Py_UNICODE and PyUnicode_FromUnicode().
* PyUnicode_FromStringAndSize -- Like PyUnicode_FromUnicode, ``PyUnicode_FromStringAndSize(NULL, size)`` cause RuntimeError instead of creating legacy unicode object.
All APIs to be changed should raise DeprecationWarning for behavior to be removed. Note that ``PyUnicode_FromUnicode`` has both of compiler deprecation warning and runtime DeprecationWarning. [3]_, [4]_.
Every function scheduled for removal? Even PyUnicode_GET_SIZE()? I'm not sure that C extensions are prepared for PyUnicode_GET_SIZE() raising an exception when using -Werror.
All deprecations will be implemented in Python 3.10. Some deprecations will be backported in Python 3.9.
Actual removal will happen in Python 3.12.
Many functions are already declared with Py_DEPRECATED() for a long time. Would it make sense to remove these functions earlier? Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On Tue, Jun 23, 2020 at 6:58 AM Victor Stinner <vstinner@python.org> wrote:
Hi INADA-san,
First of all, thanks for writing down a PEP!
Le jeu. 18 juin 2020 à 11:42, Inada Naoki <songofacandy@gmail.com> a écrit :
To support legacy Unicode object created by ``PyUnicode_FromUnicode(NULL, length)``, many Unicode APIs has ``PyUnicode_READY()`` check.
I don't see PyUnicode_READY() removal in the specification section. When can we remove these calls and the function itself?
Legacy unicode representation is using wstr so legacy unicode support is removed with wstr. PyUnicode_READY() will be no-op when wstr is removed. We can remove calling of PyUnicode_READY() since then. I think we can deprecate PyUnicode_READY() when wstr is removed.
Support of legacy Unicode object makes Unicode implementation complex. Until we drop legacy Unicode object, it is very hard to try other Unicode implementation like UTF-8 based implementation in PyPy.
I'm not sure if it should be in the scope of the PEP or not, but there are also other C API functions which are too close to the PEP 393 concrete implementation. For example, I'm not sure that PyUnicode_MAX_CHAR_VALUE(str) would be relevant/efficient if Python str is reimplemented to use UTF-8 internally. Should we deprecate it as well? Do you think that it should be addressed in a separated PEP?
I don't like optimizations which is heavily relying on CPython implementation. But I think it is too early to deprecate it. We should just recommend UTF-8 based approach.
In fact, a large part of the Unicode C API is based on the current implementation of the Python str type. For example, I'm not sure that PyUnicode_New(size, max_char) would still make sense if we change the code to store strings as UTF-8 internally.
In an ideal world, I would prefer to have a "string builder" API, like the current _PyUnicodeWriter C API, to create a string, and only never allow to modify a string in-place.
I completely agree with you. But current _PyUnicodeWriter is tight coupled with PEP 393 and it is not UTF-8 based. I am not sure that we should make it public and stable from Python 3.10. I think we should recommend `PyUnicode_FromStringAndSize(utf8, utf8_len)` for now to avoid too tightly coupled with PEP 393. Regards, -- Inada Naoki <songofacandy@gmail.com>
Le mar. 23 juin 2020 à 04:02, Inada Naoki <songofacandy@gmail.com> a écrit :
Legacy unicode representation is using wstr so legacy unicode support is removed with wstr. PyUnicode_READY() will be no-op when wstr is removed. We can remove calling of PyUnicode_READY() since then.
I think we can deprecate PyUnicode_READY() when wstr is removed.
Would it be possible to rewrite the plan differently (merge Specification sections) to list changes per Python version? Something like: == Python 3.9 == * Deprecate xxx in the documentation and add Py_DEPRECATED() * Remove xxx == Python 3.10 == * Deprecate xxx in the documentation and add Py_DEPRECATED() * Add DeprecationWarning at runtime in xxx * Remove xxx == Python 3.11 == * Remove wstr member * Remove xxx functions * PyUnicode_READY() is kept for backward compatibility but it deprecated and becomes as no-opt * ... == Python 3.12 == * Remove PyUnicode_READY() * ... Also, some functions are already deprecated. Would you mind to list them in the PEP? I fail to track the status of each function. Victor
On Tue, Jun 23, 2020 at 6:31 PM Victor Stinner <vstinner@python.org> wrote:
Le mar. 23 juin 2020 à 04:02, Inada Naoki <songofacandy@gmail.com> a écrit :
Legacy unicode representation is using wstr so legacy unicode support is removed with wstr. PyUnicode_READY() will be no-op when wstr is removed. We can remove calling of PyUnicode_READY() since then.
I think we can deprecate PyUnicode_READY() when wstr is removed.
Would it be possible to rewrite the plan differently (merge Specification sections) to list changes per Python version? Something like:
OK, I rewrite the PEP. https://github.com/python/peps/pull/1462
Also, some functions are already deprecated. Would you mind to list them in the PEP? I fail to track the status of each function.
Do you mean APIs relating to Py_UNICODE, but not relating to wstr nor legacy Unicode? (e.g. PyLong_FromUnicode, PyUnicode_Encode, etc...) We can remove them one-by-one basis. * Most APIs can be removed in 3.10. * Some API can be undeprecated by changing Py_UNICODE to wchar_t. * Some APIs needs more discussion (e.g. PyUnicodeEncodeError_Create, PyUnicodeTranslateError_Create). Since they are independent from wstr and legacy Unicode object, I don't want to handle them in this PEP. Regards, -- Inada Naoki <songofacandy@gmail.com>
I commented https://github.com/python/peps/pull/1462 as a review. Le mer. 24 juin 2020 à 10:21, Inada Naoki <songofacandy@gmail.com> a écrit :
Do you mean APIs relating to Py_UNICODE, but not relating to wstr nor legacy Unicode? (e.g. PyLong_FromUnicode, PyUnicode_Encode, etc...)
We can remove them one-by-one basis.
Oh, I forgot about these. I suggest to deprecate all of them in Python 3.10 and remove them in Python 3.12. Or is there any good reason to keep them?
Since they are independent from wstr and legacy Unicode object, I don't want to handle them in this PEP.
Ok. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
participants (2)
-
Inada Naoki
-
Victor Stinner