[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

July 1, 2020

      On Thu, Jul 2, 2020 at 5:20 AM M.-A. Lemburg <mal@egenix.com> wrote:
...
The reasoning here is the same as for decoding: you have the original
data you want to process available in some array and want to turn
this into the Python object.
The path Victor suggested requires always going via a Python Unicode
object, but that it very expensive and not really an appropriate
way to address the use case.
But current PyUnicode_Encode* APIs does `PyUnicode_FromWideChar`.
It is no direct API already.

Additionally, pyodbc, the only user of the encoder API, did
PyUnicode_EncodeUTF16(PyUnicode_AsUnicode(unicode), ...)
It is very inefficient.  Unicode Object -> Py_UNICODE* -> Unicode
Object -> byte object.

And as many others already said, most C world use UTF-8 for Unicode
representation in C,
not wchar_t.

So I don't want to undeprecate current API.
...
As an example application, think of a database module which provides
the Unicode data as Py_UNICODE buffer.
Py_UNICODE is deprecated.  So I assume you are talking about wchar_t.
...
You want to write this as UTF-8
data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API
decode this for you into a bytes object which you can then write out
using the Python C APIs for this.
PyUnicode_FromWideChar + PyUnicode_AsUTF8AndSize is better than
PyUnicode_EncodeUTF8.

PyUnicode_EncodeUTF8 allocate temporary Unicode object anyway. So it needs
to allocate Unicode object *and* char* buffer for UTF-8.
On the other hand, PyUnicode_AsUTF8AndSize can just expose internal
data when it is plain ASCII. Since ASCII string is very common, this
is effective
optimization.

Regards,
-- 
Inada Naoki  <songofacandy@gmail.com>