[Python-Dev] Re: Draft PEP: Remove wstr from Unicode

22 Jun 2020

      Hi INADA-san,

First of all, thanks for writing down a PEP!

Le jeu. 18 juin 2020 à 11:42, Inada Naoki <songofacandy@gmail.com> a écrit :
...
To support legacy Unicode object created by
``PyUnicode_FromUnicode(NULL, length)``, many Unicode APIs has
``PyUnicode_READY()`` check.
I don't see PyUnicode_READY() removal in the specification section.
When can we remove these calls and the function itself?
...
Support of legacy Unicode object makes Unicode implementation complex.
Until we drop legacy Unicode object, it is very hard to try other Unicode
implementation like UTF-8 based implementation in PyPy.
I'm not sure if it should be in the scope of the PEP or not, but there
are also other C API functions which are too close to the PEP 393
concrete implementation. For example, I'm not sure that
PyUnicode_MAX_CHAR_VALUE(str) would be relevant/efficient if Python
str is reimplemented to use UTF-8 internally. Should we deprecate it
as well? Do you think that it should be addressed in a separated PEP?

In fact, a large part of the Unicode C API is based on the current
implementation of the Python str type. For example, I'm not sure that
PyUnicode_New(size, max_char) would still make sense if we change the
code to store strings as UTF-8 internally.

In an ideal world, I would prefer to have a "string builder" API, like
the current _PyUnicodeWriter C API, to create a string, and only never
allow to modify a string in-place.

CPython "almost" immutable str "if reference count is equal to 1" has
corner cases and can be misused. But again, I don't think that it
should be part of this PEP :-) Sorry for being off-topic ;-)
...
Specification
=============
Affected APIs
--------------
From the Unicode implementation, ``wstr`` and ``wstr_length`` members are
removed.
Macros and functions to be removed:
* PyUnicode_GET_SIZE
* PyUnicode_GET_DATA_SIZE
* Py_UNICODE_WSTR_LENGTH
* PyUnicode_AS_UNICODE
* PyUnicode_AS_DATA
* PyUnicode_AsUnicode
* PyUnicode_AsUnicodeAndSize
Which ones are already deprecated?
...
Behaviors to be removed:
* PyUnicode_FromUnicode -- ``PyUnicode_FromUnicode(NULL, size)`` where
  ``size > 0`` cause RuntimeError instead of creating legacy Unicode
  object. While this API is deprecated by PEP 393, this API will be kept
  when ``wstr`` is removed. This API will be removed later.
I'm not sure that it's relevant to keep PyUnicode_FromUnicode()
whereas PyUnicode_FromWideChar() has a clean API (use wchar_t*, not
Py_UNICODE*). I also suggest to disallow PyUnicode_FromUnicode(NULL,
0) as well.

By the way, when can we finally remove the Py_UNICODE type?

I would prefer to remove Py_UNICODE and PyUnicode_FromUnicode().
...
* PyUnicode_FromStringAndSize -- Like PyUnicode_FromUnicode,
  ``PyUnicode_FromStringAndSize(NULL, size)`` cause RuntimeError
  instead of creating legacy unicode object.
...
All APIs to be changed should raise DeprecationWarning for behavior to be
removed. Note that ``PyUnicode_FromUnicode`` has both of compiler deprecation
warning and runtime DeprecationWarning. [3]_, [4]_.
Every function scheduled for removal? Even PyUnicode_GET_SIZE()? I'm
not sure that C extensions are prepared for PyUnicode_GET_SIZE()
raising an exception when using -Werror.
...
All deprecations will be implemented in Python 3.10.
Some deprecations will be backported in Python 3.9.
Actual removal will happen in Python 3.12.
Many functions are already declared with Py_DEPRECATED() for a long
time. Would it make sense to remove these functions earlier?

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.

[Python-Dev] Re: Draft PEP: Remove wstr from Unicode

Victor Stinner