[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

June 30, 2020

      On 29.06.2020 11:57, Victor Stinner wrote:
...
Le dim. 28 juin 2020 à 11:22, M.-A. Lemburg <mal@egenix.com> a écrit :
...
as you may remember, I wasn't happy with the deprecations of the
APIs in PEP 393, since there are no C API alternatives for
the encoding APIs deprecated in the PEP, which allow direct
encoding provided by these important codecs.
AFAIK, the situation hasn't changed since then.
I would prefer to analyze the list on a case by case basis. I don't
think that it's useful to expose every single encoding supported by
Python as a C function.
I designed the Unicode C API as a rich API, so that it's easy
to use from C extensions and the interpreter as well.

The main theme was to have symmetric API for both encoding and
decoding. The PEP now suggests to remove the API on the basis of
deprecating Py_UNICODE, which is a change in data type.

This does not mean we have to give up the symmetry in the C API,
or that the encoding APIs are now suddenly useless. It only means
that we have to replace Py_UNICODE with one of the supported data
for storing Unicode.

Since the C world has adopted wchar_t for this purpose, it's the
natural choice.
...
I would prefer to only have a fast-path for the most common encodings:
ASCII, Latin1, UTF-8, Windows ANSI code page. That's all.
For any other encodings, the general PyUnicode_AsEncodedString() and
PyUnicode_Decode() function are good enough.
PyUnicode_AsEncodedString() converts Unicode objects to a
bytes object. This is not an symmetric replacement for the
PyUnicode_Encode*() APIs, since those go from Py_UNICODE to
a bytes object.
...
If someone expects an overhead of passing a string, please prove it
with a benchmark. But IMO a small overhead is acceptable for rare
encodings.
Note: PyUnicode_AsEncodedString() and PyUnicode_Decode() also have
"fast paths" for most common encodings: ASCII, UTF-8, "mbcs" (Python
alias of the Windows ANSI code page), Latin1. But also UTF-16 and
UTF-32: I'm not if it's really worth it to have these ones, but it was
cheap to have them :-)
...
We can't just remove access to one half of a codec (the decoding
part) without at least providing an alternative for C extensions
Sorry, I meant the "encoding part".
...
...
to use.
I disagree, we can. The alternative exists since Python 2:
PyUnicode_AsEncodedString() and PyUnicode_Decode().
See above.

If we remove the direct encoding/decoding C APIs we should at the
very least provide generic alternatives which can be used as drop-in
replacement for the PyUnicde_Encode*() APIs.
...
...
Given PEP 393, this would be APIs which use wchar_t instead of
Py_UNICODE.
Using wchar_t is inefficient on all platforms using 16-bit wchar_t
since surrogate pairs need a special code path. For example,
PyUnicode_FromWideChar() has to scan the string twice: the first time
to count the number of surrogate pairs, to allocate the exact buffer
size.
If you want full UCS4 compatibility, that's true, but those platforms
suffer from this deficiency platform wide, so Python is in no way
special.

The main point is that wchar_t is the standard in C to represent
Unicode code points, so it's a natural choice as replacement for
Py_UNICODE.

Since the C API is not only meant to be used by the CPython interpreter,
we should stick to standards rather than expecting the world to adapt
to our implementations. This also makes the APIs future proof, e.g.
in case we make another transition from the current hybrid internal
data type for Unicode towards UTF-8 buffers as internal data type.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts
...
...
...
Python Projects, Coaching and Consulting ...  http://www.egenix.com/
Python Database Interfaces ...           http://products.egenix.com/
Plone/Zope Database Interfaces ...           http://zope.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/