On Tue, Jun 23, 2020 at 6:58 AM Victor Stinner <vstinner@python.org> wrote:
Hi INADA-san,
First of all, thanks for writing down a PEP!
Le jeu. 18 juin 2020 à 11:42, Inada Naoki <songofacandy@gmail.com> a écrit :
To support legacy Unicode object created by ``PyUnicode_FromUnicode(NULL, length)``, many Unicode APIs has ``PyUnicode_READY()`` check.
I don't see PyUnicode_READY() removal in the specification section. When can we remove these calls and the function itself?
Legacy unicode representation is using wstr so legacy unicode support is removed with wstr. PyUnicode_READY() will be no-op when wstr is removed. We can remove calling of PyUnicode_READY() since then. I think we can deprecate PyUnicode_READY() when wstr is removed.
Support of legacy Unicode object makes Unicode implementation complex. Until we drop legacy Unicode object, it is very hard to try other Unicode implementation like UTF-8 based implementation in PyPy.
I'm not sure if it should be in the scope of the PEP or not, but there are also other C API functions which are too close to the PEP 393 concrete implementation. For example, I'm not sure that PyUnicode_MAX_CHAR_VALUE(str) would be relevant/efficient if Python str is reimplemented to use UTF-8 internally. Should we deprecate it as well? Do you think that it should be addressed in a separated PEP?
I don't like optimizations which is heavily relying on CPython implementation. But I think it is too early to deprecate it. We should just recommend UTF-8 based approach.
In fact, a large part of the Unicode C API is based on the current implementation of the Python str type. For example, I'm not sure that PyUnicode_New(size, max_char) would still make sense if we change the code to store strings as UTF-8 internally.
In an ideal world, I would prefer to have a "string builder" API, like the current _PyUnicodeWriter C API, to create a string, and only never allow to modify a string in-place.
I completely agree with you. But current _PyUnicodeWriter is tight coupled with PEP 393 and it is not UTF-8 based. I am not sure that we should make it public and stable from Python 3.10. I think we should recommend `PyUnicode_FromStringAndSize(utf8, utf8_len)` for now to avoid too tightly coupled with PEP 393. Regards, -- Inada Naoki <songofacandy@gmail.com>