[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

July 1, 2020

      On 30.06.2020 15:17, Victor Stinner wrote:
...
Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg <mal@egenix.com> a écrit :
...
...
I would prefer to analyze the list on a case by case basis. I don't
think that it's useful to expose every single encoding supported by
Python as a C function.
(...)
This does not mean we have to give up the symmetry in the C API,
or that the encoding APIs are now suddenly useless. It only means
that we have to replace Py_UNICODE with one of the supported data
for storing Unicode.
Let's agree to disagree :-)
I don't think that completeness is a good rationale to design the C API.
Oh, if that's your opinion, then we definitely disagree :-)

I strongly believe that the success of Python was in major parts
built on the fact that Python does have a complete and easily
usable C API.

Without this, Python would never have convinced
the "Python is slow" advocates that you can actually build
fast applications in Python by using Python to orchestrate and
integrate with low level C libraries, and we'd be regarded as
yet another Tcl.
...
The C API is too large, we have to make it smaller.
That*s a different discussion, but disagree on that perspective
as well: we have to refactor parts of the Python C API to make it
more consistent and remove hacks which developers sometimes added
as helper functions without considering the big picture approach.

The Unicode API has over the year grown a lot of such helpers
and there's certainly room for improvement, but simply ripping out
things is not always the right answer, esp. not when you touch the
very core of the design.
...
A specialized
function, like PyUnicode_AsUTF8String(), can be justified by different
reasons:
* It is a very common use case and so it helps to write C extensions
* It is significantly faster than the alternative generic function
In C, you can execute arbitrary Python code by calling methods on
Python str objects. For example, "abc".encode("utf-8",
"surrogateescape") in Python becomes PyObject_CallMethod(obj,
"encode", "ss", "utf-8", "surrogatepass") in C. Well, there is already
a more specialized and generic PyUnicode_AsEncodedObject() function.
You know as well as I do, that the Python call mechanism is by far
the slowest part in the Python C API, so telling developers to
use this as the main way to run tasks which can be run much faster,
easier and with less memory overhead or copying of data by directly
calling a simple C API, is not a good way to advocate for a useful
Python C API.
...
We must not add a C API function for every single Python feature,
otherwise it would be too expensive to maintain, and it would become
impossible for other Python implementations to implement the fully C
API. Well, even today, PyPy already only implements a small subset of
the C API.
I honestly don't think that other Python implementations should
even try to implement the Python C API. Instead, they should build
a bridge to use the CPython runtime and integrate this into their
system.
...
...
Since the C world has adopted wchar_t for this purpose, it's the
natural choice.
In my experience, in C extensions, there are two kind of data:
* bytes is used as a "char*": array of bytes
* Unicode is used as a Python object
Uhm, what about all those applications, libraries and OS calls
producing Unicode data ? It is not always feasible or even desired
to first convert this into a Python Unicode object.
...
For the very rare cases involving wchar_t*, PyUnicode_FromWideChar()
can be used. I don't think that performance justifies to duplicate
each function, once for a Python str object, once for wchar_t*. I
mostly saw code involving wchar_t* to initialize Python. But this code
was wrong since it used PyUnicode function *before* Python was
initialized. That's bad and can now crash in recent Python versions.
But that*s an entirely unrelated issue, right ? The C lib has
full support for wchar_t and provides plenty of APIs for using
it. The main() invocation is just one small part of the lib C
Unicode system.
...
The new PEP 587 has a different design and avoids Python objects and
anything related to the Python runtime:
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig_SetString
Moreover, CPython implements functions taking wchar_t* string by
calling PyUnicode_FromWideChar() internally...
I mentioned wchar_t as buffer input replacement for the
PyUnicode_Encode*() API as alternative to the deprecated
Py_UNICODE.

Of course, you can convert all whcar_t data into a Python Unicode
object first and then apply operations on this, but the point of
the encode APIs is to have a low-level access to the Python codecs
which works directly on a data buffer - not a Unicode object.

Again, with the main intent to avoid unnecessary copying of data,
scanning, preparing, etc. etc. as is needed for
PyUnicode_FromWideChar().
...
...
PyUnicode_AsEncodedString() converts Unicode objects to a
bytes object. This is not an symmetric replacement for the
PyUnicode_Encode*() APIs, since those go from Py_UNICODE to
a bytes object.
I don't see which feature is missing from PyUnicode_AsEncodedString().
If it's about parameters specific to some encodings like UTF-7, I
already replied in another email.
The symmetry is about buffer -> Python object. Decoding takes
a byte stream data buffer and converts it into a Python Unicode object.
Encoding takes a Unicode data buffer and converts is into a
Python bytes object.

There*s nothing missing in PyUnicode_AsEncodedString() (except perhaps
for some extra encoding parameters), but it's not a proper replacement
for the buffer -> Python object APIs I'm talking about.
...
...
Since the C API is not only meant to be used by the CPython interpreter,
we should stick to standards rather than expecting the world to adapt
to our implementations. This also makes the APIs future proof, e.g.
in case we make another transition from the current hybrid internal
data type for Unicode towards UTF-8 buffers as internal data type.
Do you know C extensions in the wild which are using wchar_t* on
purpose? I haven't seen such a C extension yet.
Yes, of course. Any library which supports standards will have
to deal with wchar_t, since it is the standard :-)

Whether wchar_t and it's representations on various platforms
is a good choice, is a different discussion (and one we had many many
times in the past). The main reason for Python to adopt UCS4 was
that the Linux glibc used it for wchar_t.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts
...
...
...
Python Projects, Coaching and Consulting ...  http://www.egenix.com/
Python Database Interfaces ...           http://products.egenix.com/
Plone/Zope Database Interfaces ...           http://zope.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/