
On 28.06.2020 16:24, Inada Naoki wrote:
Hi, Lamburg.
Thank you for quick response.
We can't just remove access to one half of a codec (the decoding part) without at least providing an alternative for C extensions to use.
Py_UNICODE can be removed from the API, but only if there are alternative APIs which C extensions can use to the same effect.
Given PEP 393, this would be APIs which use wchar_t instead of Py_UNICODE.
Decoding part is implemented as `const char *` -> `PyObject*` (Unicode object). I think this is reasonable since `const char *` is perfect to abstract the encoded string,
In case of encoding part, `wchar_t *` is not perfect abstraction for (decoded) unicode string.
Note that the PyUnicode_Encode*() APIs are meant to be make the codec's encoding machinery available to C extensions, so that they don't have to implement this again. In that sense, their purpose is not to encode an existing Unicode object, but instead, to provide access to the low-level buffer to bytes object encoding. The reasoning here is the same as for decoding: you have the original data you want to process available in some array and want to turn this into the Python object. The path Victor suggested requires always going via a Python Unicode object, but that it very expensive and not really an appropriate way to address the use case. As an example application, think of a database module which provides the Unicode data as Py_UNICODE buffer. You want to write this as UTF-8 data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API decode this for you into a bytes object which you can then write out using the Python C APIs for this.
Converting from Unicode object into `wchar_t *` is not zero-cost. I think `PyObject *` (Unicode object) -> `PyObject *` (bytes object) looks better signature than `wchar_t *` -> `Pyobject *` (bytes object) because for encoders.
See above. The motivation for these APIs is different. They are not about taking a Unicode object and converting it into bytes, they are deliberately about taking a data buffer as input and producing the Python bytes object as output (to implement symmetry between decoding and encoding).
* Unicode object is more important than `wchar_t *` in Python.
Right, but as I tried to explain in my reply to Victor, I designed the Unicode API in Python to be a rich API, which provides all necessary tools to easily work with Unicode in C extensions as well as in the CPython interpreter. The API is not only focused on what the CPython interpreter needs. It's an API which implements a concise interface to Unicode as used in Python.
* All PyUnicode_EncodeXXX APIs are implemented with PyUnicode_FromWideChar.
For example, we have these private encode APIs:
* PyObject* _PyUnicode_AsAsciiString(PyObject *unicode, const char *errors) * PyObject* _PyUnicode_AsLatin1String(PyObject *unicode, const char *errors) * PyObject* _PyUnicode_AsUTF8String(PyObject *unicode, const char *errors) * PyObject* _PyUnicode_EncodeUTF16(PyObject *unicode, const char *errors, int byteorder) ...
So how about making them public, instead of undeprecate Py_UNICODE* encode APIs?
I'd be fine with keeping just a generic PyUnicode_Encode() API, but this should then be encoding from a buffer to a bytes object. The above all take Unicode objects as input and create the same problem as I described above, with the temporary Unicode object being created and all the associated malloc and scanning overhead needed for this. The reason I mention wchar_t as new basis for the PyUnicde_Encode() API is that whcar_t has grown to be accepted as the standard for Unicode buffers in C. If you don't believe that this is good enough, we could also force Py_UCS4, but this would alienate Windows extension writers.
1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10. Current private APIs can become macro (e.g. #define _PyUnicode_AsAsciiString PyUnicode_AsAsciiBytes), or deprecated static inline function. 2. Remove Py_UNICODE* encode APIs in Python 3.12.
FWIW: I don't object to deprecating Py_UNICODE. I just don't want to lose the symmetry in decoding/encoding and add the cost of having to go via a Python Unicode object just to decode to bytes. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/