[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

July 1, 2020

      On 28.06.2020 16:24, Inada Naoki wrote:
...
Hi, Lamburg.
Thank you for quick response.
...
We can't just remove access to one half of a codec (the decoding
part) without at least providing an alternative for C extensions
to use.
Py_UNICODE can be removed from the API, but only if there are
alternative APIs which C extensions can use to the same effect.
Given PEP 393, this would be APIs which use wchar_t instead of
Py_UNICODE.
Decoding part is implemented as `const char *` -> `PyObject*` (Unicode object).
I think this is reasonable since `const char *` is perfect to abstract
the encoded string,
In case of encoding part, `wchar_t *` is not perfect abstraction for
(decoded) unicode
string.
Note that the PyUnicode_Encode*() APIs are meant to be make the
codec's encoding machinery available to C extensions, so that they
don't have to implement this again.

In that sense, their purpose is not to encode an existing Unicode
object, but instead, to provide access to the low-level buffer to
bytes object encoding.

The reasoning here is the same as for decoding: you have the original
data you want to process available in some array and want to turn
this into the Python object.

The path Victor suggested requires always going via a Python Unicode
object, but that it very expensive and not really an appropriate
way to address the use case.

As an example application, think of a database module which provides
the Unicode data as Py_UNICODE buffer. You want to write this as UTF-8
data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API
decode this for you into a bytes object which you can then write out
using the Python C APIs for this.
...
Converting from Unicode object into `wchar_t *` is not zero-cost.
I think `PyObject *` (Unicode object) -> `PyObject *` (bytes object)
looks better signature than
`wchar_t *` -> `Pyobject *` (bytes object) because for encoders.
See above. The motivation for these APIs is different. They are
not about taking a Unicode object and converting it into bytes,
they are deliberately about taking a data buffer as input and
producing the Python bytes object as output (to implement symmetry
between decoding and encoding).
...
* Unicode object is more important than `wchar_t *` in Python.
Right, but as I tried to explain in my reply to Victor, I designed
the Unicode API in Python to be a rich API, which provides all
necessary tools to easily work with Unicode in C extensions as
well as in the CPython interpreter.

The API is not only focused on what the CPython interpreter needs.
It's an API which implements a concise interface to Unicode as
used in Python.
...
* All PyUnicode_EncodeXXX APIs are implemented with PyUnicode_FromWideChar.
For example, we have these private encode APIs:
* PyObject* _PyUnicode_AsAsciiString(PyObject *unicode, const char *errors)
* PyObject* _PyUnicode_AsLatin1String(PyObject *unicode, const char *errors)
* PyObject* _PyUnicode_AsUTF8String(PyObject *unicode, const char *errors)
* PyObject* _PyUnicode_EncodeUTF16(PyObject *unicode, const char
*errors, int byteorder)
...
So how about making them public, instead of undeprecate Py_UNICODE* encode APIs?
I'd be fine with keeping just a generic PyUnicode_Encode() API,
but this should then be encoding from a buffer to a bytes object.

The above all take Unicode objects as input and create the same
problem as I described above, with the temporary Unicode object being
created and all the associated malloc and scanning overhead needed
for this.

The reason I mention wchar_t as new basis for the PyUnicde_Encode()
API is that whcar_t has grown to be accepted as the standard for
Unicode buffers in C. If you don't believe that this is good enough,
we could also force Py_UCS4, but this would alienate Windows extension
writers.
...
1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10.
   Current private APIs can become macro (e.g. #define
_PyUnicode_AsAsciiString PyUnicode_AsAsciiBytes),
   or deprecated static inline function.
2. Remove Py_UNICODE* encode APIs in Python 3.12.
FWIW: I don't object to deprecating Py_UNICODE. I just don't
want to lose the symmetry in decoding/encoding and add the cost
of having to go via a Python Unicode object just to decode
to bytes.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts
...
...
...
Python Projects, Coaching and Consulting ...  http://www.egenix.com/
Python Database Interfaces ...           http://products.egenix.com/
Plone/Zope Database Interfaces ...           http://zope.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

M.-A. Lemburg