[Python-Dev] Re: Draft PEP: Remove wstr from Unicode

June 22, 2020

      On Tue, Jun 23, 2020 at 6:58 AM Victor Stinner <vstinner@python.org> wrote:
...
Hi INADA-san,
First of all, thanks for writing down a PEP!
Le jeu. 18 juin 2020 à 11:42, Inada Naoki <songofacandy@gmail.com> a écrit :
...
To support legacy Unicode object created by
``PyUnicode_FromUnicode(NULL, length)``, many Unicode APIs has
``PyUnicode_READY()`` check.
I don't see PyUnicode_READY() removal in the specification section.
When can we remove these calls and the function itself?
Legacy unicode representation is using wstr so legacy unicode support
is removed with wstr.
PyUnicode_READY() will be no-op when wstr is removed.  We can remove
calling of PyUnicode_READY() since then.

I think we can deprecate PyUnicode_READY() when wstr is removed.
...
...
Support of legacy Unicode object makes Unicode implementation complex.
Until we drop legacy Unicode object, it is very hard to try other Unicode
implementation like UTF-8 based implementation in PyPy.
I'm not sure if it should be in the scope of the PEP or not, but there
are also other C API functions which are too close to the PEP 393
concrete implementation. For example, I'm not sure that
PyUnicode_MAX_CHAR_VALUE(str) would be relevant/efficient if Python
str is reimplemented to use UTF-8 internally. Should we deprecate it
as well? Do you think that it should be addressed in a separated PEP?
I don't like optimizations which is heavily relying on CPython
implementation. But I think it is too early to deprecate it.
We should just recommend UTF-8 based approach.
...
In fact, a large part of the Unicode C API is based on the current
implementation of the Python str type. For example, I'm not sure that
PyUnicode_New(size, max_char) would still make sense if we change the
code to store strings as UTF-8 internally.
In an ideal world, I would prefer to have a "string builder" API, like
the current _PyUnicodeWriter C API, to create a string, and only never
allow to modify a string in-place.
I completely agree with you.  But current _PyUnicodeWriter is tight
coupled with PEP 393 and it is not UTF-8 based.  I am not sure that
we should make it public and stable from Python 3.10.

I think we should recommend `PyUnicode_FromStringAndSize(utf8, utf8_len)`
for now to avoid too tightly coupled with PEP 393.

Regards,

-- 
Inada Naoki  <songofacandy@gmail.com>