
On 30.06.2020 15:17, Victor Stinner wrote:
Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg <mal@egenix.com> a écrit :
I would prefer to analyze the list on a case by case basis. I don't think that it's useful to expose every single encoding supported by Python as a C function.
(...) This does not mean we have to give up the symmetry in the C API, or that the encoding APIs are now suddenly useless. It only means that we have to replace Py_UNICODE with one of the supported data for storing Unicode.
Let's agree to disagree :-)
I don't think that completeness is a good rationale to design the C API.
Oh, if that's your opinion, then we definitely disagree :-) I strongly believe that the success of Python was in major parts built on the fact that Python does have a complete and easily usable C API. Without this, Python would never have convinced the "Python is slow" advocates that you can actually build fast applications in Python by using Python to orchestrate and integrate with low level C libraries, and we'd be regarded as yet another Tcl.
The C API is too large, we have to make it smaller.
That*s a different discussion, but disagree on that perspective as well: we have to refactor parts of the Python C API to make it more consistent and remove hacks which developers sometimes added as helper functions without considering the big picture approach. The Unicode API has over the year grown a lot of such helpers and there's certainly room for improvement, but simply ripping out things is not always the right answer, esp. not when you touch the very core of the design.
A specialized function, like PyUnicode_AsUTF8String(), can be justified by different reasons:
* It is a very common use case and so it helps to write C extensions * It is significantly faster than the alternative generic function
In C, you can execute arbitrary Python code by calling methods on Python str objects. For example, "abc".encode("utf-8", "surrogateescape") in Python becomes PyObject_CallMethod(obj, "encode", "ss", "utf-8", "surrogatepass") in C. Well, there is already a more specialized and generic PyUnicode_AsEncodedObject() function.
You know as well as I do, that the Python call mechanism is by far the slowest part in the Python C API, so telling developers to use this as the main way to run tasks which can be run much faster, easier and with less memory overhead or copying of data by directly calling a simple C API, is not a good way to advocate for a useful Python C API.
We must not add a C API function for every single Python feature, otherwise it would be too expensive to maintain, and it would become impossible for other Python implementations to implement the fully C API. Well, even today, PyPy already only implements a small subset of the C API.
I honestly don't think that other Python implementations should even try to implement the Python C API. Instead, they should build a bridge to use the CPython runtime and integrate this into their system.
Since the C world has adopted wchar_t for this purpose, it's the natural choice.
In my experience, in C extensions, there are two kind of data:
* bytes is used as a "char*": array of bytes * Unicode is used as a Python object
Uhm, what about all those applications, libraries and OS calls producing Unicode data ? It is not always feasible or even desired to first convert this into a Python Unicode object.
For the very rare cases involving wchar_t*, PyUnicode_FromWideChar() can be used. I don't think that performance justifies to duplicate each function, once for a Python str object, once for wchar_t*. I mostly saw code involving wchar_t* to initialize Python. But this code was wrong since it used PyUnicode function *before* Python was initialized. That's bad and can now crash in recent Python versions.
But that*s an entirely unrelated issue, right ? The C lib has full support for wchar_t and provides plenty of APIs for using it. The main() invocation is just one small part of the lib C Unicode system.
The new PEP 587 has a different design and avoids Python objects and anything related to the Python runtime: https://docs.python.org/dev/c-api/init_config.html#c.PyConfig_SetString
Moreover, CPython implements functions taking wchar_t* string by calling PyUnicode_FromWideChar() internally...
I mentioned wchar_t as buffer input replacement for the PyUnicode_Encode*() API as alternative to the deprecated Py_UNICODE. Of course, you can convert all whcar_t data into a Python Unicode object first and then apply operations on this, but the point of the encode APIs is to have a low-level access to the Python codecs which works directly on a data buffer - not a Unicode object. Again, with the main intent to avoid unnecessary copying of data, scanning, preparing, etc. etc. as is needed for PyUnicode_FromWideChar().
PyUnicode_AsEncodedString() converts Unicode objects to a bytes object. This is not an symmetric replacement for the PyUnicode_Encode*() APIs, since those go from Py_UNICODE to a bytes object.
I don't see which feature is missing from PyUnicode_AsEncodedString(). If it's about parameters specific to some encodings like UTF-7, I already replied in another email.
The symmetry is about buffer -> Python object. Decoding takes a byte stream data buffer and converts it into a Python Unicode object. Encoding takes a Unicode data buffer and converts is into a Python bytes object. There*s nothing missing in PyUnicode_AsEncodedString() (except perhaps for some extra encoding parameters), but it's not a proper replacement for the buffer -> Python object APIs I'm talking about.
Since the C API is not only meant to be used by the CPython interpreter, we should stick to standards rather than expecting the world to adapt to our implementations. This also makes the APIs future proof, e.g. in case we make another transition from the current hybrid internal data type for Unicode towards UTF-8 buffers as internal data type.
Do you know C extensions in the wild which are using wchar_t* on purpose? I haven't seen such a C extension yet.
Yes, of course. Any library which supports standards will have to deal with wchar_t, since it is the standard :-) Whether wchar_t and it's representations on various platforms is a good choice, is a different discussion (and one we had many many times in the past). The main reason for Python to adopt UCS4 was that the Linux glibc used it for wchar_t. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/