[capi-sig]PyUnicode C API

7 Sep 2018 · *current*

      On 07.09.2018 10:22, Victor Stinner wrote:
...
...
I'm in discussion with PyPy developers, and they reported different
APIs which cause them troubles:
(...)

almost all PyUnicode API functions have to go according to them.
PyPy3 uses UTF-8 internally, CPython uses "compact string" (array of
Py_UCS1, Py_UCS2 or Py_UCS4 depending on the string content).
https://pythoncapi.readthedocs.io/bad_api.html#pypy-requests

Le ven. 7 sept. 2018 à 10:33, M.-A. Lemburg <mal@egenix.com> a écrit :
...
I'm -1 on removing the PyUnicode APIs. We deliberately created a
useful and very complete C API for Unicode.
The fact that PyPy chose to use a different internal representation
is not a good reason to remove APIs and have CPython extension take
the hit as a result. It would be better for PyPy rethink the
internal representation or create a shim API which translates
between the two worlds.
Note that UTF-8 is not a good internal representation for Unicode
if you want fast indexing and slicing. This is why we are using
fixed code units to represent the Unicode strings.
The PyUnicode C API is not only an issue for PyPy, it's also an issue
for CPython. When the PEP 393 has been implemented, suddly, most of
the PyUnicode API has been directly deprecated: all functions using
the now legacy Py_UNICODE* type...
Python 3.7 still has to support both the legacy Py_UNICODE* API and
the new "compact string" API. It makes the CPython code base way more
complex that it should be: any function accepting a string is supposed
to call PyUnicode_Ready() and handle error properly. I would prefer to
be able to remove the legacy PyUnicodeObject type, to only use compact
strings everywhere.
Let me elaborate what are good and bad functions for PyUnicode.
Example of bad APIs:

PyUnicode_IS_COMPACT(): this API really rely on the *current* implementation
PyUnicode_2BYTE_DATA(): should only be used internally, there is no
need to export it
PyUnicode_READ()
Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail

Good API:

PyUnicode_Concat(): C API for str + str
PyUnicode_Split()
PyUnicode_FindChar()

Border line:

PyUnicode_IS_ASCII(op): it's a O(1) operation on CPython, but it can
O(n) on other implementations (like PyPy which uses UTF-8). But we
also added str.isascii() in Python 3.7....
PyUnicode_READ_CHAR()
PyUnicode_CompareWithASCIIString(): the function name announces
ASCII but decodes the byte string from Latin1 :-)

Victor