
On 29.06.2020 11:57, Victor Stinner wrote:
Le dim. 28 juin 2020 à 11:22, M.-A. Lemburg <mal@egenix.com> a écrit :
as you may remember, I wasn't happy with the deprecations of the APIs in PEP 393, since there are no C API alternatives for the encoding APIs deprecated in the PEP, which allow direct encoding provided by these important codecs.
AFAIK, the situation hasn't changed since then.
I would prefer to analyze the list on a case by case basis. I don't think that it's useful to expose every single encoding supported by Python as a C function.
I designed the Unicode C API as a rich API, so that it's easy to use from C extensions and the interpreter as well. The main theme was to have symmetric API for both encoding and decoding. The PEP now suggests to remove the API on the basis of deprecating Py_UNICODE, which is a change in data type. This does not mean we have to give up the symmetry in the C API, or that the encoding APIs are now suddenly useless. It only means that we have to replace Py_UNICODE with one of the supported data for storing Unicode. Since the C world has adopted wchar_t for this purpose, it's the natural choice.
I would prefer to only have a fast-path for the most common encodings: ASCII, Latin1, UTF-8, Windows ANSI code page. That's all.
For any other encodings, the general PyUnicode_AsEncodedString() and PyUnicode_Decode() function are good enough.
PyUnicode_AsEncodedString() converts Unicode objects to a bytes object. This is not an symmetric replacement for the PyUnicode_Encode*() APIs, since those go from Py_UNICODE to a bytes object.
If someone expects an overhead of passing a string, please prove it with a benchmark. But IMO a small overhead is acceptable for rare encodings.
Note: PyUnicode_AsEncodedString() and PyUnicode_Decode() also have "fast paths" for most common encodings: ASCII, UTF-8, "mbcs" (Python alias of the Windows ANSI code page), Latin1. But also UTF-16 and UTF-32: I'm not if it's really worth it to have these ones, but it was cheap to have them :-)
We can't just remove access to one half of a codec (the decoding part) without at least providing an alternative for C extensions
Sorry, I meant the "encoding part".
to use.
I disagree, we can. The alternative exists since Python 2: PyUnicode_AsEncodedString() and PyUnicode_Decode().
See above. If we remove the direct encoding/decoding C APIs we should at the very least provide generic alternatives which can be used as drop-in replacement for the PyUnicde_Encode*() APIs.
Given PEP 393, this would be APIs which use wchar_t instead of Py_UNICODE.
Using wchar_t is inefficient on all platforms using 16-bit wchar_t since surrogate pairs need a special code path. For example, PyUnicode_FromWideChar() has to scan the string twice: the first time to count the number of surrogate pairs, to allocate the exact buffer size.
If you want full UCS4 compatibility, that's true, but those platforms suffer from this deficiency platform wide, so Python is in no way special. The main point is that wchar_t is the standard in C to represent Unicode code points, so it's a natural choice as replacement for Py_UNICODE. Since the C API is not only meant to be used by the CPython interpreter, we should stick to standards rather than expecting the world to adapt to our implementations. This also makes the APIs future proof, e.g. in case we make another transition from the current hybrid internal data type for Unicode towards UTF-8 buffers as internal data type. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/