
Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg <mal@egenix.com> a écrit :
I would prefer to analyze the list on a case by case basis. I don't think that it's useful to expose every single encoding supported by Python as a C function.
(...) This does not mean we have to give up the symmetry in the C API, or that the encoding APIs are now suddenly useless. It only means that we have to replace Py_UNICODE with one of the supported data for storing Unicode.
Let's agree to disagree :-) I don't think that completeness is a good rationale to design the C API. The C API is too large, we have to make it smaller. A specialized function, like PyUnicode_AsUTF8String(), can be justified by different reasons: * It is a very common use case and so it helps to write C extensions * It is significantly faster than the alternative generic function In C, you can execute arbitrary Python code by calling methods on Python str objects. For example, "abc".encode("utf-8", "surrogateescape") in Python becomes PyObject_CallMethod(obj, "encode", "ss", "utf-8", "surrogatepass") in C. Well, there is already a more specialized and generic PyUnicode_AsEncodedObject() function. We must not add a C API function for every single Python feature, otherwise it would be too expensive to maintain, and it would become impossible for other Python implementations to implement the fully C API. Well, even today, PyPy already only implements a small subset of the C API.
Since the C world has adopted wchar_t for this purpose, it's the natural choice.
In my experience, in C extensions, there are two kind of data: * bytes is used as a "char*": array of bytes * Unicode is used as a Python object For the very rare cases involving wchar_t*, PyUnicode_FromWideChar() can be used. I don't think that performance justifies to duplicate each function, once for a Python str object, once for wchar_t*. I mostly saw code involving wchar_t* to initialize Python. But this code was wrong since it used PyUnicode function *before* Python was initialized. That's bad and can now crash in recent Python versions. The new PEP 587 has a different design and avoids Python objects and anything related to the Python runtime: https://docs.python.org/dev/c-api/init_config.html#c.PyConfig_SetString Moreover, CPython implements functions taking wchar_t* string by calling PyUnicode_FromWideChar() internally...
PyUnicode_AsEncodedString() converts Unicode objects to a bytes object. This is not an symmetric replacement for the PyUnicode_Encode*() APIs, since those go from Py_UNICODE to a bytes object.
I don't see which feature is missing from PyUnicode_AsEncodedString(). If it's about parameters specific to some encodings like UTF-7, I already replied in another email.
Since the C API is not only meant to be used by the CPython interpreter, we should stick to standards rather than expecting the world to adapt to our implementations. This also makes the APIs future proof, e.g. in case we make another transition from the current hybrid internal data type for Unicode towards UTF-8 buffers as internal data type.
Do you know C extensions in the wild which are using wchar_t* on purpose? I haven't seen such a C extension yet. Victor -- Night gathers, and now my watch begins. It shall not end until my death.