
On Thu, Jul 9, 2020 at 5:46 AM M.-A. Lemburg <mal@egenix.com> wrote:
- the fact that the encode APIs encoding from a Unicode buffer to a bytes object; this is an important fact, since the removal removes access to this codec functionality for extensions
- PyUnicode_AsEncodedString() is not a proper alternative, since it requires to create a temporary PyUnicode object, which is inefficient and wastes memory
I wrote your points in the "Alternative Idea > Replace Py_UNICODE* with Py_UCS4* " section. I wrote "User can encode UCS-4 string in C without creating Unicode object." in it. https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-py-ucs4 Note that the current Py_UNICODE* encoder APIs create temporary PyUnicode objects. They are inefficient and wastes memory now. Py_UNICODE* may be UTF-16 on some platforms (e.g. Windows) and builtin codecs don't support UTF-16 input.
- the maintenance effect mentioned in the PEP does not really materialize, since the underlying functionality still exists in the codecs - only access to the functionality is removed
In the same section, I described the maintenance cost as below. * Other Python implementations may not have builtin codec for UCS-4. * If we change the Unicode internal representation to UTF-8, we need to keep UCS-4 support only for these APIs.
- keeping just the generic PyUnicode_Encode() API would be a compromise
- if we remove the codec specific PyUnicode_Encode*() APIs, why are we still keeping the specisl PyUnicde_Decode*() APIs ?
OK, I will add "Discussions" section. (I don't like "FAQ" because some question are important even if it is not "frequently" asked.) Quick answer is: * They are stable ABI. (Py_UNICODE is excluded from stable ABI). * Decoding from char* is more common and generic use case than encoding from Py_UNICODE*. * Other Python implementations using UTF-8 as internal representation can implement it easily. But I'm not opposite to remove it (especially for minor UTF-7 codec). It is just out of scope of this PEP.
- the deprecations were just done because the Py_UNICODE data type was replaced by a hybrid type. Using this as an argument for removing functionality is not really good practice, when these are ways to continue exposing the functionality using other data types.
I hope the "Replace Py_UNICODE* with Py_UCS4* " section describe this. Regards, -- Inada Naoki <songofacandy@gmail.com>