[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

June 30, 2020

      Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg <mal@egenix.com> a écrit :
...
...
I would prefer to analyze the list on a case by case basis. I don't
think that it's useful to expose every single encoding supported by
Python as a C function.
(...)
This does not mean we have to give up the symmetry in the C API,
or that the encoding APIs are now suddenly useless. It only means
that we have to replace Py_UNICODE with one of the supported data
for storing Unicode.
Let's agree to disagree :-)

I don't think that completeness is a good rationale to design the C API.

The C API is too large, we have to make it smaller. A specialized
function, like PyUnicode_AsUTF8String(), can be justified by different
reasons:

* It is a very common use case and so it helps to write C extensions
* It is significantly faster than the alternative generic function

In C, you can execute arbitrary Python code by calling methods on
Python str objects. For example, "abc".encode("utf-8",
"surrogateescape") in Python becomes PyObject_CallMethod(obj,
"encode", "ss", "utf-8", "surrogatepass") in C. Well, there is already
a more specialized and generic PyUnicode_AsEncodedObject() function.

We must not add a C API function for every single Python feature,
otherwise it would be too expensive to maintain, and it would become
impossible for other Python implementations to implement the fully C
API. Well, even today, PyPy already only implements a small subset of
the C API.
...
Since the C world has adopted wchar_t for this purpose, it's the
natural choice.
In my experience, in C extensions, there are two kind of data:

* bytes is used as a "char*": array of bytes
* Unicode is used as a Python object

For the very rare cases involving wchar_t*, PyUnicode_FromWideChar()
can be used. I don't think that performance justifies to duplicate
each function, once for a Python str object, once for wchar_t*. I
mostly saw code involving wchar_t* to initialize Python. But this code
was wrong since it used PyUnicode function *before* Python was
initialized. That's bad and can now crash in recent Python versions.
The new PEP 587 has a different design and avoids Python objects and
anything related to the Python runtime:
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig_SetString

Moreover, CPython implements functions taking wchar_t* string by
calling PyUnicode_FromWideChar() internally...
...
PyUnicode_AsEncodedString() converts Unicode objects to a
bytes object. This is not an symmetric replacement for the
PyUnicode_Encode*() APIs, since those go from Py_UNICODE to
a bytes object.
I don't see which feature is missing from PyUnicode_AsEncodedString().
If it's about parameters specific to some encodings like UTF-7, I
already replied in another email.
...
Since the C API is not only meant to be used by the CPython interpreter,
we should stick to standards rather than expecting the world to adapt
to our implementations. This also makes the APIs future proof, e.g.
in case we make another transition from the current hybrid internal
data type for Unicode towards UTF-8 buffers as internal data type.
Do you know C extensions in the wild which are using wchar_t* on
purpose? I haven't seen such a C extension yet.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

Victor Stinner