
On Thu, Jul 2, 2020 at 5:20 AM M.-A. Lemburg <mal@egenix.com> wrote:
The reasoning here is the same as for decoding: you have the original data you want to process available in some array and want to turn this into the Python object.
The path Victor suggested requires always going via a Python Unicode object, but that it very expensive and not really an appropriate way to address the use case.
But current PyUnicode_Encode* APIs does `PyUnicode_FromWideChar`. It is no direct API already. Additionally, pyodbc, the only user of the encoder API, did PyUnicode_EncodeUTF16(PyUnicode_AsUnicode(unicode), ...) It is very inefficient. Unicode Object -> Py_UNICODE* -> Unicode Object -> byte object. And as many others already said, most C world use UTF-8 for Unicode representation in C, not wchar_t. So I don't want to undeprecate current API.
As an example application, think of a database module which provides the Unicode data as Py_UNICODE buffer.
Py_UNICODE is deprecated. So I assume you are talking about wchar_t.
You want to write this as UTF-8 data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API decode this for you into a bytes object which you can then write out using the Python C APIs for this.
PyUnicode_FromWideChar + PyUnicode_AsUTF8AndSize is better than PyUnicode_EncodeUTF8. PyUnicode_EncodeUTF8 allocate temporary Unicode object anyway. So it needs to allocate Unicode object *and* char* buffer for UTF-8. On the other hand, PyUnicode_AsUTF8AndSize can just expose internal data when it is plain ASCII. Since ASCII string is very common, this is effective optimization. Regards, -- Inada Naoki <songofacandy@gmail.com>