[capi-sig]Re: PyUnicode C API

Sept. 7, 2018 · *exact*

      Le ven. 7 sept. 2018 à 14:25, Stefan Behnel <python_capi@behnel.de> a écrit :
...
...
...
...
PyPy3 uses UTF-8 internally
Well, it's not like anyone asked them to do that. ;)
Please don't be so focused on PyPy. There are other implementations of
Python like MicroPython, IronPython, RustPython, etc. Why should
everybody mimick the *exact* implementaton of CPython? What's the
point of having a different implementation of Python if you *must* use
exactly all design choices than CPython?
I repeat, the C API is a already threat to *CPython*. (See what I
wrote about all Py_UNICODE* APIs.)
...
...
Python 3.7 still has to support both the legacy Py_UNICODE* API and
the new "compact string" API. It makes the CPython code base way more
complex that it should be: any function accepting a string is supposed
to call PyUnicode_Ready() and handle error properly. I would prefer to
be able to remove the legacy PyUnicodeObject type, to only use compact
strings everywhere.
Since we control all paths that can lead to the creation of a Unicode
object, why not make sure we always call the internal equivalent to
PyType_Ready() before even returning it to the user, and replace the public
PyType_Ready() function by a no-op macro?
It's PyUnicode_Ready(), not PyType_Ready() :-)
No, we don't control anything: PyUnicode_FromStringAndSize(NULL, size)
remains accessible in the C API and I'm sure that *many* C extensions
use it. This function creates a string using Py_UNICODE* internally.
There is also PyUnicode_FromUnicode() and many other functions. We
only started to deprecated them in the documentation recently. They
are only deprecated in the documentation, Py_DEPRECATED() is not used
in C headers yet:
PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode(
const Py_UNICODE *u,        /* Unicode buffer */
Py_ssize_t size             /* size of buffer */
) /* Py_DEPRECATED(3.3) */;
That's why we must keep PyUnicode_Ready() in almost all functions
accepting strings...
...
...
...

PyUnicode_IS_COMPACT(): this API really rely on the *current*
implementation

Is that function really used?
Used or not, it's currently part of the C API. I propose to kick it
out of the public exported API, and really make it private. For
example, just add a _ prefix: _PyUnicode_IS_COMPACT().
This function is heavily used internally in unicodeobject.c. I'm not
sure why anyone would use it outside Python :-)
It's just one example of public API which should not be public.
...
...

PyUnicode_READ()

That should be easy to emulate with a different internal representation,
though. Just use an arbitrary value for "kind" and ignore it on the way in.
This function doesn't make sense on Python 3.2 and older, on PyPy,
etc. It should just be made private.
...
That reminds me that the current C-API lacks a way to calculate the size of
the character buffer in bytes. You have to use "length * kind", which
really relies on internals then. There should be a macro for this.
I don't think that it's a good idea to provide such function, since
the size of one character depends on the size of other characters in a
string when using compact strings...
...
...

Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail

But given that "Py_UNICODE" is already deprecated … :)
Well, "deprecated"...
Victor

[capi-sig]Re: PyUnicode C API

Victor Stinner