On 07.09.2018 10:22, Victor Stinner wrote:
I'm in discussion with PyPy developers, and they reported different APIs which cause them troubles: (...)
- almost all PyUnicode API functions have to go according to them. PyPy3 uses UTF-8 internally, CPython uses "compact string" (array of Py_UCS1, Py_UCS2 or Py_UCS4 depending on the string content). https://pythoncapi.readthedocs.io/bad_api.html#pypy-requests
Le ven. 7 sept. 2018 à 10:33, M.-A. Lemburg <mal@egenix.com> a écrit :
I'm -1 on removing the PyUnicode APIs. We deliberately created a useful and very complete C API for Unicode.
The fact that PyPy chose to use a different internal representation is not a good reason to remove APIs and have CPython extension take the hit as a result. It would be better for PyPy rethink the internal representation or create a shim API which translates between the two worlds.
Note that UTF-8 is not a good internal representation for Unicode if you want fast indexing and slicing. This is why we are using fixed code units to represent the Unicode strings.
The PyUnicode C API is not only an issue for PyPy, it's also an issue for CPython. When the PEP 393 has been implemented, suddly, most of the PyUnicode API has been directly deprecated: all functions using the now legacy Py_UNICODE* type...
Python 3.7 still has to support both the legacy Py_UNICODE* API and the new "compact string" API. It makes the CPython code base way more complex that it should be: any function accepting a string is supposed to call PyUnicode_Ready() and handle error properly. I would prefer to be able to remove the legacy PyUnicodeObject type, to only use compact strings everywhere.
Let me elaborate what are good and bad functions for PyUnicode.
Example of bad APIs:
- PyUnicode_IS_COMPACT(): this API really rely on the *current* implementation
- PyUnicode_2BYTE_DATA(): should only be used internally, there is no need to export it
- PyUnicode_READ()
- Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail
Good API:
- PyUnicode_Concat(): C API for str + str
- PyUnicode_Split()
- PyUnicode_FindChar()
Border line:
- PyUnicode_IS_ASCII(op): it's a O(1) operation on CPython, but it can O(n) on other implementations (like PyPy which uses UTF-8). But we also added str.isascii() in Python 3.7....
- PyUnicode_READ_CHAR()
- PyUnicode_CompareWithASCIIString(): the function name announces ASCII but decodes the byte string from Latin1 :-)
Victor