Le ven. 7 sept. 2018 à 14:25, Stefan Behnel <python_capi@behnel.de> a écrit :
PyPy3 uses UTF-8 internally
Well, it's not like anyone asked them to do that. ;)
Please don't be so focused on PyPy. There are other implementations of Python like MicroPython, IronPython, RustPython, etc. Why should everybody mimick the *exact* implementaton of CPython? What's the point of having a different implementation of Python if you *must* use exactly all design choices than CPython?
I repeat, the C API is a already threat to *CPython*. (See what I wrote about all Py_UNICODE* APIs.)
Python 3.7 still has to support both the legacy Py_UNICODE* API and the new "compact string" API. It makes the CPython code base way more complex that it should be: any function accepting a string is supposed to call PyUnicode_Ready() and handle error properly. I would prefer to be able to remove the legacy PyUnicodeObject type, to only use compact strings everywhere.
Since we control all paths that can lead to the creation of a Unicode object, why not make sure we always call the internal equivalent to PyType_Ready() before even returning it to the user, and replace the public PyType_Ready() function by a no-op macro?
It's PyUnicode_Ready(), not PyType_Ready() :-)
No, we don't control anything: PyUnicode_FromStringAndSize(NULL, size) remains accessible in the C API and I'm sure that *many* C extensions use it. This function creates a string using Py_UNICODE* internally. There is also PyUnicode_FromUnicode() and many other functions. We only started to deprecated them in the documentation recently. They are only deprecated in the documentation, Py_DEPRECATED() is not used in C headers yet:
PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode( const Py_UNICODE *u, /* Unicode buffer */ Py_ssize_t size /* size of buffer */ ) /* Py_DEPRECATED(3.3) */;
That's why we must keep PyUnicode_Ready() in almost all functions accepting strings...
- PyUnicode_IS_COMPACT(): this API really rely on the *current* implementation
Is that function really used?
Used or not, it's currently part of the C API. I propose to kick it out of the public exported API, and really make it private. For example, just add a _ prefix: _PyUnicode_IS_COMPACT().
This function is heavily used internally in unicodeobject.c. I'm not sure why anyone would use it outside Python :-)
It's just one example of public API which should not be public.
- PyUnicode_READ()
That should be easy to emulate with a different internal representation, though. Just use an arbitrary value for "kind" and ignore it on the way in.
This function doesn't make sense on Python 3.2 and older, on PyPy, etc. It should just be made private.
That reminds me that the current C-API lacks a way to calculate the size of the character buffer in bytes. You have to use "length * kind", which really relies on internals then. There should be a macro for this.
I don't think that it's a good idea to provide such function, since the size of one character depends on the size of other characters in a string when using compact strings...
- Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail
But given that "Py_UNICODE" is already deprecated … :)
Well, "deprecated"...
Victor