PEP 624: Remove Py_UNICODE encoder APIs
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
Hi, folks. Since the previous discussion was suspended without consensus, I wrote a new PEP for it. (Thank you Victor for reviewing it!) This PEP looks very similar to PEP 623 "Remove wstr from Unicode", but for encoder APIs, not for Unicode object APIs. URL (not available yet): https://www.python.org/dev/peps/pep-0624/ --- PEP: 624 Title: Remove Py_UNICODE encoder APIs Author: Inada Naoki <songofacandy@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 06-Jul-2020 Python-Version: 3.11 Abstract ======== This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11: * ``PyUnicode_Encode()`` * ``PyUnicode_EncodeASCII()`` * ``PyUnicode_EncodeLatin1()`` * ``PyUnicode_EncodeUTF7()`` * ``PyUnicode_EncodeUTF8()`` * ``PyUnicode_EncodeUTF16()`` * ``PyUnicode_EncodeUTF32()`` * ``PyUnicode_EncodeUnicodeEscape()`` * ``PyUnicode_EncodeRawUnicodeEscape()`` * ``PyUnicode_EncodeCharmap()`` * ``PyUnicode_TranslateCharmap()`` * ``PyUnicode_EncodeDecimal()`` * ``PyUnicode_TransformDecimalToASCII()`` .. note:: `PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP is not relating to Unicode object. These PEPs are split because they have different motivation and need different discussion. Motivation ========== In general, reducing the number of APIs that have been deprecated for a long time and have few users is a good idea for not only it improves the maintainability of CPython, but it also helps API users and other Python implementations. Rationale ========= Deprecated since Python 3.3 --------------------------- ``Py_UNICODE`` and APIs using it are deprecated since Python 3.3. Inefficient ----------- All of these APIs are implemented using ``PyUnicode_FromWideChar``. So these APIs are inefficient when user want to encode Unicode object. Not used widely --------------- When searching from top 4000 PyPI packages [1]_, only pyodbc use these APIs. * ``PyUnicode_EncodeUTF8()`` * ``PyUnicode_EncodeUTF16()`` pyodbc uses these APIs to encode Unicode object into bytes object. So it is easy to fix it. [2]_ Alternative APIs ================ There are alternative APIs to accept ``PyObject *unicode`` instead of ``Py_UNICODE *``. Users can migrate to them. ========================================= ========================================== Deprecated API Alternative APIs ========================================= ========================================== ``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()`` ``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1) ``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1) ``PyUnicode_EncodeUTF7()`` \(2) ``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1) ``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3) ``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3) ``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()`` ``PyUnicode_EncodeRawUnicodeEscape()`` ``PyUnicode_AsRawUnicodeEscapeString()`` ``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1) ``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()`` ``PyUnicode_EncodeDecimal()`` \(4) ``PyUnicode_TransformDecimalToASCII()`` \(4) ========================================= ========================================== Notes: (1) ``const char *errors`` parameter is missing. (2) There is no public alternative API. But user can use generic ``PyUnicode_AsEncodedString()`` instead. (3) ``const char *errors, int byteorder`` parameters are missing. (4) There is no direct replacement. But ``Py_UNICODE_TODECIMAL`` can be used instead. CPython uses ``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting from Unicode to numbers instead. Plan ==== Python 3.9 ---------- Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)`` already. * ``PyUnicode_EncodeDecimal()`` * ``PyUnicode_TransformDecimalToASCII()``. Document all APIs as "will be removed in version 3.11". Python 3.11 ----------- These APIs are removed. * ``PyUnicode_Encode()`` * ``PyUnicode_EncodeASCII()`` * ``PyUnicode_EncodeLatin1()`` * ``PyUnicode_EncodeUTF7()`` * ``PyUnicode_EncodeUTF8()`` * ``PyUnicode_EncodeUTF16()`` * ``PyUnicode_EncodeUTF32()`` * ``PyUnicode_EncodeUnicodeEscape()`` * ``PyUnicode_EncodeRawUnicodeEscape()`` * ``PyUnicode_EncodeCharmap()`` * ``PyUnicode_TranslateCharmap()`` * ``PyUnicode_EncodeDecimal()`` * ``PyUnicode_TransformDecimalToASCII()`` Alternative ideas ================= Instead of just removing deprecated APIs, we may be able to use thier names with different signature. Make some private APIs public ------------------------------ ``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs. Some APIs have alternative public APIs. But they are missing ``const char *errors`` or ``int byteorder`` parameters. We can rename some private APIs and make them public to cover missing APIs and parameters. ============================= ================================ Rename to Rename from ============================= ================================ ``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()`` ``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()`` ``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()`` ``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()`` ``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()`` ``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()`` ============================= ================================ Pros: * We have more consistent API set. Cons: * We have more public APIs to maintain. * Existing public APIs are enough for most use cases, and ``PyUnicode_AsEncodedString()`` can be used in other cases. Replace ``Py_UNICODE*`` with ``Py_UCS4*`` ----------------------------------------- We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with ``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to convert ``Py_UCS4*`` string to Unicode object. Pros: * We have more consistent API set. * User can encode UCS-4 string in C without creating Unicode object. Cons: * We have more public APIs to maintain. * Applications which uses UTF-8 or UTF-32 can not use these APIs anyway. * Other Python implementations may not have builtin codec for UCS-4. * If we change the Unicode internal representation to UTF-8, we need to keep UCS-4 support only for these APIs. Replace ``Py_UNICODE*`` with ``wchar_t*`` ----------------------------------------- We can replace ``Py_UNICODE`` to ``wchar_t``. Pros: * We have more consistent API set. * Backward compatible. Cons: * We have more public APIs to maintain. * They are inefficient on platforms ``wchar_t*`` is UTF-16. It is because built-in codecs supports only UCS-1, UCS-2, and UCS-4 input. Rejected ideas ============== Using runtime warning --------------------- These APIs doesn't release GIL for now. Emitting a warning from such APIs is not safe. See this example. .. code-block:: PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference. PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u), PyUnicode_GET_SIZE(u), NULL); // Assumes u is still living reference. PyObject *t = PyTuple_Pack(2, u, b); Py_DECREF(b); return t; If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning filters and other threads may change the ``list`` and ``u`` can be a dangling reference after ``PyUnicode_EncodeUTF8()`` returned. Additionally, since we are not changing behavior but removing C APIs, runtime ``DeprecationWarning`` might not helpful for Python developers. We should warn to extension developers instead. Discussions =========== * `Plan to remove Py_UNICODE APis except PEP 623 <https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_ * `bpo-41123: Remove Py_UNICODE APIs except PEP 623: <https://bugs.python.org/issue41123>`_ References ========== .. [1] Source package list chosen from top 4000 PyPI packages. (https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.t...) .. [2] pyodbc -- Don't use PyUnicode_Encode API #792 (https://github.com/mkleehammer/pyodbc/pull/792) .. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318) (https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e...) Copyright ========= This document has been placed in the public domain. -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/15b1cd41a4c23e7dc10893777afb4281.jpg?s=120&d=mm&r=g)
Le mar. 7 juil. 2020 à 17:21, Inada Naoki <songofacandy@gmail.com> a écrit :
This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11:
Overall, I like the plan. IMHO 3.11 is a reasonable target version, since on the top 4000 projects, only 2 are affected and it is easy to fix them.
I guess that if the release manager is not ok to add the two remaining Py_DEPRECATED() warnings, they can be added to 3.10 instead.
If needed, new functions can be added independently of this PEP.
DeprecationWarning is hidden by default: users would not be impacted. I don't think that encoding functions are special enough to skip these warnings. I think that it's reasonable to change the behavior on these deprecated functions to emit a warning. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
Hi Inada-san, I am currently too busy with EuroPython to participate in longer discussions. FWIW: I intend to continue after EuroPython. In any case, thanks for writing up the PEP. Could you please add my points about: - the fact that the encode APIs encoding from a Unicode buffer to a bytes object; this is an important fact, since the removal removes access to this codec functionality for extensions - PyUnicode_AsEncodedString() is not a proper alternative, since it requires to create a temporary PyUnicode object, which is inefficient and wastes memory - the maintenance effect mentioned in the PEP does not really materialize, since the underlying functionality still exists in the codecs - only access to the functionality is removed - keeping just the generic PyUnicode_Encode() API would be a compromise - if we remove the codec specific PyUnicode_Encode*() APIs, why are we still keeping the specisl PyUnicde_Decode*() APIs ? - the deprecations were just done because the Py_UNICODE data type was replaced by a hybrid type. Using this as an argument for removing functionality is not really good practice, when these are ways to continue exposing the functionality using other data types. I am still strongly -1 on removing all encoding APIs without at least some upgrade path for existing code to use and keeping the API symmetric. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ On 07.07.2020 17:17, Inada Naoki wrote:
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Thu, Jul 9, 2020 at 5:46 AM M.-A. Lemburg <mal@egenix.com> wrote:
I wrote your points in the "Alternative Idea > Replace Py_UNICODE* with Py_UCS4* " section. I wrote "User can encode UCS-4 string in C without creating Unicode object." in it. https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-py-ucs4 Note that the current Py_UNICODE* encoder APIs create temporary PyUnicode objects. They are inefficient and wastes memory now. Py_UNICODE* may be UTF-16 on some platforms (e.g. Windows) and builtin codecs don't support UTF-16 input.
In the same section, I described the maintenance cost as below. * Other Python implementations may not have builtin codec for UCS-4. * If we change the Unicode internal representation to UTF-8, we need to keep UCS-4 support only for these APIs.
OK, I will add "Discussions" section. (I don't like "FAQ" because some question are important even if it is not "frequently" asked.) Quick answer is: * They are stable ABI. (Py_UNICODE is excluded from stable ABI). * Decoding from char* is more common and generic use case than encoding from Py_UNICODE*. * Other Python implementations using UTF-8 as internal representation can implement it easily. But I'm not opposite to remove it (especially for minor UTF-7 codec). It is just out of scope of this PEP.
I hope the "Replace Py_UNICODE* with Py_UCS4* " section describe this. Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/d91ce240d2445584e295b5406d12df70.jpg?s=120&d=mm&r=g)
Unless I'm missing something, part of M.-A. Lemburg's objection is: 1. The wchar_t type is itself an important interoperability story in C. (I'm not sure if this includes the ability, at compile time, to define wchar_t as either of two widths.) 2. The ability to work directly with wchar_t without a round-trip in/out of python format is an important feature that CPython has provided for C integrators. 3. The above support can be kept even without the wchar_t* member ... so saving the extra space on each string instance does not require dropping this support. -jJ
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Thu, Jul 9, 2020 at 10:13 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
Unless I'm missing something, part of M.-A. Lemburg's objection is:
1. The wchar_t type is itself an important interoperability story in C. (I'm not sure if this includes the ability, at compile time, to define wchar_t as either of two widths.)
Of course. But wchar_t* is not the only way to use Unicode in C. UTF-8 is the most common way to use Unicode in C in recent days. (except Java, .NET, and Windows API) So the importance of wchar_t* APIs are relative, not absolute. In other words, why don't we have an encode API with direct UTF-8 input? Is there any evidence wchar_t* is much more important than UTF-8?
2. The ability to work directly with wchar_t without a round-trip in/out of python format is an important feature that CPython has provided for C integrators.
Note that current API *does* the round-trip: For example: https://github.com/python/cpython/blob/61bb24a270d15106decb1c7983bf4c2831671... Users can not use the API without initializing Python VM. Users can not avoid time and space for the round-trip. So removing these APIs doesn't reduce any ability.
3. The above support can be kept even without the wchar_t* member ... so saving the extra space on each string instance does not require dropping this support.
This is why I split PEP 623 and PEP 624. I never said removing the wchar_t* member is motivation for PEP 624. Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
Hi, Lemburg. Thank you for organizing the EuroPython 2020. I enjoyed watching some sessions from home. I think current PEP 624 covers all your points and ready for Steering Council discussion. Would you like to review the PEP before it? Regards, On Thu, Jul 9, 2020 at 8:19 AM Inada Naoki <songofacandy@gmail.com> wrote:
-- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
Hi Inada-san, thanks for attending EuroPython. I won't be back online until next Wednesday. Would it be possible to wait until then to continue the discussion ? Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ On 04.08.2020 05:13, Inada Naoki wrote:
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
Hi Inada-san, thank you for adding some comments, but they are not really capturing what I think is missing: """ Removing these APIs removes ability to use codec without temporary Unicode. Codecs can not encode Unicode buffer directly without temporary Unicode object since Python 3.3. All these APIs creates temporary Unicode object for now. So removing them doesn't reduce any abilities. """ The point is that while the decoders allow going from a C object to a Python object directly, we are missing a way to do the same for the encoders, since the Python 3.3 change in the Unicode internals. At the very least, we should have such APIs for going from wchar_t* to a Python object. The alternatives you provide all require creating an intermediate Python object for this purpose. The APIs you want to remove do that as well, but that's not the point. The point is to expose the codecs' decode mechanism which is available in the C code, but currently not exposed via C APIs, e.g. ucs4lib_utf8_encode(). It would be breaking change, but those APIs in your list could simply be changed from using Py_UNICODE to using whcar_t instead and then interface directly to the internal functions we have for the encoders. That would keep extensions working after a recompile, since Py_UNICODE is already a typedef to wchar_t. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 01 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/ On 22.01.2021 07:47, Inada Naoki wrote:
![](https://secure.gravatar.com/avatar/15b1cd41a4c23e7dc10893777afb4281.jpg?s=120&d=mm&r=g)
On Mon, Feb 1, 2021 at 4:47 PM M.-A. Lemburg <mal@egenix.com> wrote:
We cannot optimize all use cases. IMO we should only optimize conversions between char* and Python object. I don't see the need for two conversions (char* => Python and then Python => wchar_t*) as an issue if you need wchar_t*. Objects/unicodeobject.c is already very complex with specialization for ASCII, Py_UCS1 (latin1), Py_UCS2 and Py_UCS4 kinds: 16k lines of C code. I would prefer to make it simpler than more complex. Internally, functions like PyUnicode_EncodeLatin1() already do the two conversions. So it's not like the PEP has any impact on performance.
That would keep extensions working after a recompile, since Py_UNICODE is already a typedef to wchar_t.
Extensions should not use Py_UNICODE*/wchar_t*. Can you explain where wchar_t* type is appropriate and how two conversions is a performance bottleneck? Victor -- Night gathers, and now my watch begins. It shall not end until my death.
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
On 01.02.2021 17:10, Victor Stinner wrote:
The C code is already there, but it got hidden away in the Python 3.3 change to new internals. All that needs to be done is remove the intermediate Python Unicode object creation and have those encoder APIs again interface to the native C code.
Before Python 3.3 all those APIs interfaced directly to the C codec functions. The introduction of an intermediate Python Unicode object was just done as quick work-around, even though it was not really needed, since Python 3.3 did not remove the C code of the encoders.
They should not use Py_UNICODE. wchar_t is standard C and is in wide spread use in C code for storing Unicode data. This was one of the main reason for introducing UCS4 Python versions for Linux in the mid 2000s, since Linux apps used 4 byte wchar_t as native storage format. My point is that extensions would just need a recompile with the change from Py_UNICODE to wchar_t, since Py_UNICODE and wchar_t are already the same thing in Python 3.3+.
Can you explain where wchar_t* type is appropriate and how two conversions is a performance bottleneck?
If an extension has a wchar_t* string, it should be easy to convert this in to a Python bytes object for use in Python. Just like it should be easy to go from a char* string to a Python str object. The PEP breaks this symmetry by removing access to the encoder implementations. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 01 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
![](https://secure.gravatar.com/avatar/15b1cd41a4c23e7dc10893777afb4281.jpg?s=120&d=mm&r=g)
On Mon, Feb 1, 2021 at 5:39 PM M.-A. Lemburg <mal@egenix.com> wrote:
The C code is already there, but it got hidden away in the Python 3.3 change to new internals.
Well, we are not in agreement and it's ok. Your objection is written in the PEP. IMO it's now up to the Steering Council to decide if the overall PEP is ok or not. The PEP itself is now complete and lists advantages and drawbacks. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
On 01.02.2021 17:51, Victor Stinner wrote:
Please read my reply to Inada-san. If the PEP were complete and ok, I would not have written the email. The fix is pretty simple, doesn't add a lot more code and gets us the symmetry back that I had put into the Unicode C API when I created this back in 2000. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 01 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
![](https://secure.gravatar.com/avatar/15b1cd41a4c23e7dc10893777afb4281.jpg?s=120&d=mm&r=g)
On Mon, Feb 1, 2021 at 5:58 PM M.-A. Lemburg <mal@egenix.com> wrote:
This sounds like a completely different PEP than PEP 624 (which aims to remove code, not add code). I suggest you to propose your own PEP. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
![](https://secure.gravatar.com/avatar/1fee087d7a1ca17c8ad348271819a8d5.jpg?s=120&d=mm&r=g)
On Mon, 1 Feb 2021 17:39:16 +0100 "M.-A. Lemburg" <mal@egenix.com> wrote:
Do you have any data points about "wide spread use"? I work in C++ daily and don't see any "wide spread use" of wchar_t (or its C++ cousin std::wstring). Modern APIs assume bytestrings and UTF-8 encoding. Regards Antoine.
![](https://secure.gravatar.com/avatar/33bd15feb2558d0050e863875e0f5f60.jpg?s=120&d=mm&r=g)
On 01/02/2021 17.39, M.-A. Lemburg wrote:
How much software actually uses wchar_t these days and interfaces with Python? Do you have examples for software that uses wchar_t and would benefit from wchar_t support in Python? I did a quick search for wcslen in all shared libraries and binaries on my system. It's a good indicator how many programs actually use wchar_t. 126 out of more than 9,000 shared libraries and binaries contain the string "wcslen". The only hit for PyUnicode_AsWideCharString was libpypy3-c.so... (Fedora has unified /usr and /lib64, e.g. /bin -> /usr/bin) $ ls /usr/bin/ /usr/sbin/ | grep -v python | wc -l 4264 $ grep -R wcslen /usr/bin/ /usr/sbin/ | grep -v python | wc -l 92 $ find /usr/lib64/ -name '*.so' -not -name '*python*' | wc -l 5478 $ find /usr/lib64/ -name '*.so' -not -name '*python*' | xargs grep wcslen | wc -l 34 Christian
![](https://secure.gravatar.com/avatar/d995b462a98fea412efa79d17ba3787a.jpg?s=120&d=mm&r=g)
On Mon, 1 Feb 2021 at 17:19, Christian Heimes <christian@python.org> wrote:
This is very much a drive-by comment (I haven't been following this thread) so ignore me if this is already covered, but Windows APIs use wchar_t extensively. I routinely work with wchar_t when interfacing Windows API code and Python. But I have no idea what this PEP is proposing to drop, so as long as someone has ensured that the PEP won't adversely affect working with Windows APIs, I'm happy. Paul
![](https://secure.gravatar.com/avatar/be200d614c47b5a4dbb6be867080e835.jpg?s=120&d=mm&r=g)
On 2/1/2021 5:16 PM, Christian Heimes wrote:
Yeah, you searched the wrong kind of system ;) Pick up a Windows machine, cross-platform code that originated on Windows, anything that interoperates with Java or .NET as well, or uses wxWidgets. I'm not defending the choice of wchar_t over UTF-8 (but I can: most of these systems chose Unicode before UTF-8 was invented and never took the backwards-incompatible change because they were so popular), but if we want to pragmatically weigh the needs of our users above our desire for purity, then we should try and support both equally wherever possible. Cheers, Steve
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 4:28 AM Steve Dower <steve.dower@python.org> wrote:
Note that we don't have "utf8 (char*) to Python bytes object" direct encoder API. If PEP 624 is accepted, utf8 and wchar_t* become equal. So please don't think PEP 624 neglect only wchar_t*. Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 12:43 AM M.-A. Lemburg <mal@egenix.com> wrote:
We already have PyUnicode_FromWideChar(). So I assume you mean "wchar_t* to Python bytes object".
OK, I see codecs.h has three encoders. * utf8_encode * utf16_encode * utf32_encode But there are 13 encoders in my PEP: PyUnicode_Encode() PyUnicode_EncodeASCII() PyUnicode_EncodeLatin1() PyUnicode_EncodeUTF7() PyUnicode_EncodeUTF8() PyUnicode_EncodeUTF16() PyUnicode_EncodeUTF32() PyUnicode_EncodeUnicodeEscape() PyUnicode_EncodeRawUnicodeEscape() PyUnicode_EncodeCharmap() PyUnicode_TranslateCharmap() PyUnicode_EncodeDecimal() PyUnicode_TransformDecimalToASCII() Do you want to keep all encoders? or 3 encoders?
That would keep extensions working after a recompile, since Py_UNICODE is already a typedef to wchar_t.
That idea is written in the PEP already. https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-wchar-t Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
On 02.02.2021 00:33, Inada Naoki wrote:
Yes, that's what I meant. Encoding from wchar_t* to a Python bytes object. This is what the encoder APIs all implement. They have become less efficient with Python 3.3, but this can be resolved, while at the same time removing Py_UNICODE and replacing it with wchar_t in those encoder APIs.
We could keep all encoders, replacing Py_UNICODE with wchar_t in the API. For the ones where we have separate implementations as private functions, we can move back to direct encoding. For the others, we can keep using the temporary Unicode object or refactor the code to expose the native encoders working directly on the internal buffers as private functions and then use those in the same way for direct encoding. The Unicode API was meant and designed as a rich API, making it easy to use and providing a complete set for extension writers and CPython to use. I believe we should keep it that way.
Right and I think this is a more workable approach than removing APIs. BTW: I don't understand this comment: "They are inefficient on platforms wchar_t* is UTF-16. It is because built-in codecs supports only UCS-1, UCS-2, and UCS-4 input." Windows is one such platform. Java (indirectly) is another. They both store UTF-16LE in those arrays and Python's codecs handle this just fine. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 02 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 7:37 PM M.-A. Lemburg <mal@egenix.com> wrote:
I'm sorry about the section is not clear. For example, if wchar_t* is UCS4, ucs4_utf8_encoder() can encode wchar_t* into UTF-8. But when wchar_t* is UTF-16, ucs2_utf8_encoder() can not handle surrogate escape. We need to use a temporary Unicode object. That is what "inefficient" means. I will update the section more elaborate. Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/fd079755731ad59ebedad02b690340e8.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 3:47 AM Inada Naoki <songofacandy@gmail.com> wrote:
Since real UCS-2 is effectively dead, maybe it should be flipped around: Make UTF-16 be the efficient path and UCS-2 be the path that needs to round-trip through Unicode. But I suppose that's out of scope for this PEP. -Em
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 9:40 PM Emily Bowman <silverbacknet@gmail.com> wrote:
Note the ucs2_utf8_encoder() is used only for encoding Python Unicode object for now. Unicode object is latin1, UCS2, or UCS4. It never be UTF-16. So if we support add UTF-16 support to ucs2_utf8_encoder(), it means we need to add code and maintain only for PyUnicode_EncodeUTF8 (encode from wchar_t* into char*). I don't think it is a good deal. As described in the PEP, encoder APIs are used very rarely. We must not add any maintainece costs for them. Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/15b1cd41a4c23e7dc10893777afb4281.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 11:47 PM Inada Naoki <songofacandy@gmail.com> wrote:
I fixed tons of bugs related in Python 2.7 and Python 3 codecs before PEP 393 (compact strings) to handle properly 16-bit wchar_t: to handle properly surrogate characters. The implementation was complex and slow. I would prefer to not move backwards to that :-( If you are curious, look into PyUnicode_FromWideChar() implementation, search for find_maxchar_surrogates(), to have an idea of the cost of handling UTF-16 surrogate pairs. For a full codec, it's way more complex, painful to write and to maintain. I'm happy that we were able to remove that thanks to the PEP 393! Victor -- Night gathers, and now my watch begins. It shall not end until my death.
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 8:40 PM Inada Naoki <songofacandy@gmail.com> wrote:
I updated the "Alternative Ideas" section of the PEP. https://www.python.org/dev/peps/pep-0624/#alternative-ideas They replaces `Py_UNICODE*` with `PyObject*`, `Py_UCS4*`, and `wchar_t*`. I explicitly noted that some codecs can bypass temporary Unicode objects: """ UTF-8, UTF-16, UTF-32 encoders support Py_UCS4 internally. So PyUnicode_EncodeUTF8(), PyUnicode_EncodeUTF16(), and PyUnicode_EncodeUTF32() can avoid to create a temporary Unicode object. """ -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/15b1cd41a4c23e7dc10893777afb4281.jpg?s=120&d=mm&r=g)
Le mar. 7 juil. 2020 à 17:21, Inada Naoki <songofacandy@gmail.com> a écrit :
This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11:
Overall, I like the plan. IMHO 3.11 is a reasonable target version, since on the top 4000 projects, only 2 are affected and it is easy to fix them.
I guess that if the release manager is not ok to add the two remaining Py_DEPRECATED() warnings, they can be added to 3.10 instead.
If needed, new functions can be added independently of this PEP.
DeprecationWarning is hidden by default: users would not be impacted. I don't think that encoding functions are special enough to skip these warnings. I think that it's reasonable to change the behavior on these deprecated functions to emit a warning. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
Hi Inada-san, I am currently too busy with EuroPython to participate in longer discussions. FWIW: I intend to continue after EuroPython. In any case, thanks for writing up the PEP. Could you please add my points about: - the fact that the encode APIs encoding from a Unicode buffer to a bytes object; this is an important fact, since the removal removes access to this codec functionality for extensions - PyUnicode_AsEncodedString() is not a proper alternative, since it requires to create a temporary PyUnicode object, which is inefficient and wastes memory - the maintenance effect mentioned in the PEP does not really materialize, since the underlying functionality still exists in the codecs - only access to the functionality is removed - keeping just the generic PyUnicode_Encode() API would be a compromise - if we remove the codec specific PyUnicode_Encode*() APIs, why are we still keeping the specisl PyUnicde_Decode*() APIs ? - the deprecations were just done because the Py_UNICODE data type was replaced by a hybrid type. Using this as an argument for removing functionality is not really good practice, when these are ways to continue exposing the functionality using other data types. I am still strongly -1 on removing all encoding APIs without at least some upgrade path for existing code to use and keeping the API symmetric. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ On 07.07.2020 17:17, Inada Naoki wrote:
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Thu, Jul 9, 2020 at 5:46 AM M.-A. Lemburg <mal@egenix.com> wrote:
I wrote your points in the "Alternative Idea > Replace Py_UNICODE* with Py_UCS4* " section. I wrote "User can encode UCS-4 string in C without creating Unicode object." in it. https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-py-ucs4 Note that the current Py_UNICODE* encoder APIs create temporary PyUnicode objects. They are inefficient and wastes memory now. Py_UNICODE* may be UTF-16 on some platforms (e.g. Windows) and builtin codecs don't support UTF-16 input.
In the same section, I described the maintenance cost as below. * Other Python implementations may not have builtin codec for UCS-4. * If we change the Unicode internal representation to UTF-8, we need to keep UCS-4 support only for these APIs.
OK, I will add "Discussions" section. (I don't like "FAQ" because some question are important even if it is not "frequently" asked.) Quick answer is: * They are stable ABI. (Py_UNICODE is excluded from stable ABI). * Decoding from char* is more common and generic use case than encoding from Py_UNICODE*. * Other Python implementations using UTF-8 as internal representation can implement it easily. But I'm not opposite to remove it (especially for minor UTF-7 codec). It is just out of scope of this PEP.
I hope the "Replace Py_UNICODE* with Py_UCS4* " section describe this. Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/d91ce240d2445584e295b5406d12df70.jpg?s=120&d=mm&r=g)
Unless I'm missing something, part of M.-A. Lemburg's objection is: 1. The wchar_t type is itself an important interoperability story in C. (I'm not sure if this includes the ability, at compile time, to define wchar_t as either of two widths.) 2. The ability to work directly with wchar_t without a round-trip in/out of python format is an important feature that CPython has provided for C integrators. 3. The above support can be kept even without the wchar_t* member ... so saving the extra space on each string instance does not require dropping this support. -jJ
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Thu, Jul 9, 2020 at 10:13 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
Unless I'm missing something, part of M.-A. Lemburg's objection is:
1. The wchar_t type is itself an important interoperability story in C. (I'm not sure if this includes the ability, at compile time, to define wchar_t as either of two widths.)
Of course. But wchar_t* is not the only way to use Unicode in C. UTF-8 is the most common way to use Unicode in C in recent days. (except Java, .NET, and Windows API) So the importance of wchar_t* APIs are relative, not absolute. In other words, why don't we have an encode API with direct UTF-8 input? Is there any evidence wchar_t* is much more important than UTF-8?
2. The ability to work directly with wchar_t without a round-trip in/out of python format is an important feature that CPython has provided for C integrators.
Note that current API *does* the round-trip: For example: https://github.com/python/cpython/blob/61bb24a270d15106decb1c7983bf4c2831671... Users can not use the API without initializing Python VM. Users can not avoid time and space for the round-trip. So removing these APIs doesn't reduce any ability.
3. The above support can be kept even without the wchar_t* member ... so saving the extra space on each string instance does not require dropping this support.
This is why I split PEP 623 and PEP 624. I never said removing the wchar_t* member is motivation for PEP 624. Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
Hi, Lemburg. Thank you for organizing the EuroPython 2020. I enjoyed watching some sessions from home. I think current PEP 624 covers all your points and ready for Steering Council discussion. Would you like to review the PEP before it? Regards, On Thu, Jul 9, 2020 at 8:19 AM Inada Naoki <songofacandy@gmail.com> wrote:
-- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
Hi Inada-san, thanks for attending EuroPython. I won't be back online until next Wednesday. Would it be possible to wait until then to continue the discussion ? Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ On 04.08.2020 05:13, Inada Naoki wrote:
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
Hi Inada-san, thank you for adding some comments, but they are not really capturing what I think is missing: """ Removing these APIs removes ability to use codec without temporary Unicode. Codecs can not encode Unicode buffer directly without temporary Unicode object since Python 3.3. All these APIs creates temporary Unicode object for now. So removing them doesn't reduce any abilities. """ The point is that while the decoders allow going from a C object to a Python object directly, we are missing a way to do the same for the encoders, since the Python 3.3 change in the Unicode internals. At the very least, we should have such APIs for going from wchar_t* to a Python object. The alternatives you provide all require creating an intermediate Python object for this purpose. The APIs you want to remove do that as well, but that's not the point. The point is to expose the codecs' decode mechanism which is available in the C code, but currently not exposed via C APIs, e.g. ucs4lib_utf8_encode(). It would be breaking change, but those APIs in your list could simply be changed from using Py_UNICODE to using whcar_t instead and then interface directly to the internal functions we have for the encoders. That would keep extensions working after a recompile, since Py_UNICODE is already a typedef to wchar_t. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 01 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/ On 22.01.2021 07:47, Inada Naoki wrote:
![](https://secure.gravatar.com/avatar/15b1cd41a4c23e7dc10893777afb4281.jpg?s=120&d=mm&r=g)
On Mon, Feb 1, 2021 at 4:47 PM M.-A. Lemburg <mal@egenix.com> wrote:
We cannot optimize all use cases. IMO we should only optimize conversions between char* and Python object. I don't see the need for two conversions (char* => Python and then Python => wchar_t*) as an issue if you need wchar_t*. Objects/unicodeobject.c is already very complex with specialization for ASCII, Py_UCS1 (latin1), Py_UCS2 and Py_UCS4 kinds: 16k lines of C code. I would prefer to make it simpler than more complex. Internally, functions like PyUnicode_EncodeLatin1() already do the two conversions. So it's not like the PEP has any impact on performance.
That would keep extensions working after a recompile, since Py_UNICODE is already a typedef to wchar_t.
Extensions should not use Py_UNICODE*/wchar_t*. Can you explain where wchar_t* type is appropriate and how two conversions is a performance bottleneck? Victor -- Night gathers, and now my watch begins. It shall not end until my death.
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
On 01.02.2021 17:10, Victor Stinner wrote:
The C code is already there, but it got hidden away in the Python 3.3 change to new internals. All that needs to be done is remove the intermediate Python Unicode object creation and have those encoder APIs again interface to the native C code.
Before Python 3.3 all those APIs interfaced directly to the C codec functions. The introduction of an intermediate Python Unicode object was just done as quick work-around, even though it was not really needed, since Python 3.3 did not remove the C code of the encoders.
They should not use Py_UNICODE. wchar_t is standard C and is in wide spread use in C code for storing Unicode data. This was one of the main reason for introducing UCS4 Python versions for Linux in the mid 2000s, since Linux apps used 4 byte wchar_t as native storage format. My point is that extensions would just need a recompile with the change from Py_UNICODE to wchar_t, since Py_UNICODE and wchar_t are already the same thing in Python 3.3+.
Can you explain where wchar_t* type is appropriate and how two conversions is a performance bottleneck?
If an extension has a wchar_t* string, it should be easy to convert this in to a Python bytes object for use in Python. Just like it should be easy to go from a char* string to a Python str object. The PEP breaks this symmetry by removing access to the encoder implementations. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 01 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
![](https://secure.gravatar.com/avatar/15b1cd41a4c23e7dc10893777afb4281.jpg?s=120&d=mm&r=g)
On Mon, Feb 1, 2021 at 5:39 PM M.-A. Lemburg <mal@egenix.com> wrote:
The C code is already there, but it got hidden away in the Python 3.3 change to new internals.
Well, we are not in agreement and it's ok. Your objection is written in the PEP. IMO it's now up to the Steering Council to decide if the overall PEP is ok or not. The PEP itself is now complete and lists advantages and drawbacks. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
On 01.02.2021 17:51, Victor Stinner wrote:
Please read my reply to Inada-san. If the PEP were complete and ok, I would not have written the email. The fix is pretty simple, doesn't add a lot more code and gets us the symmetry back that I had put into the Unicode C API when I created this back in 2000. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 01 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
![](https://secure.gravatar.com/avatar/15b1cd41a4c23e7dc10893777afb4281.jpg?s=120&d=mm&r=g)
On Mon, Feb 1, 2021 at 5:58 PM M.-A. Lemburg <mal@egenix.com> wrote:
This sounds like a completely different PEP than PEP 624 (which aims to remove code, not add code). I suggest you to propose your own PEP. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
![](https://secure.gravatar.com/avatar/1fee087d7a1ca17c8ad348271819a8d5.jpg?s=120&d=mm&r=g)
On Mon, 1 Feb 2021 17:39:16 +0100 "M.-A. Lemburg" <mal@egenix.com> wrote:
Do you have any data points about "wide spread use"? I work in C++ daily and don't see any "wide spread use" of wchar_t (or its C++ cousin std::wstring). Modern APIs assume bytestrings and UTF-8 encoding. Regards Antoine.
![](https://secure.gravatar.com/avatar/33bd15feb2558d0050e863875e0f5f60.jpg?s=120&d=mm&r=g)
On 01/02/2021 17.39, M.-A. Lemburg wrote:
How much software actually uses wchar_t these days and interfaces with Python? Do you have examples for software that uses wchar_t and would benefit from wchar_t support in Python? I did a quick search for wcslen in all shared libraries and binaries on my system. It's a good indicator how many programs actually use wchar_t. 126 out of more than 9,000 shared libraries and binaries contain the string "wcslen". The only hit for PyUnicode_AsWideCharString was libpypy3-c.so... (Fedora has unified /usr and /lib64, e.g. /bin -> /usr/bin) $ ls /usr/bin/ /usr/sbin/ | grep -v python | wc -l 4264 $ grep -R wcslen /usr/bin/ /usr/sbin/ | grep -v python | wc -l 92 $ find /usr/lib64/ -name '*.so' -not -name '*python*' | wc -l 5478 $ find /usr/lib64/ -name '*.so' -not -name '*python*' | xargs grep wcslen | wc -l 34 Christian
![](https://secure.gravatar.com/avatar/d995b462a98fea412efa79d17ba3787a.jpg?s=120&d=mm&r=g)
On Mon, 1 Feb 2021 at 17:19, Christian Heimes <christian@python.org> wrote:
This is very much a drive-by comment (I haven't been following this thread) so ignore me if this is already covered, but Windows APIs use wchar_t extensively. I routinely work with wchar_t when interfacing Windows API code and Python. But I have no idea what this PEP is proposing to drop, so as long as someone has ensured that the PEP won't adversely affect working with Windows APIs, I'm happy. Paul
![](https://secure.gravatar.com/avatar/be200d614c47b5a4dbb6be867080e835.jpg?s=120&d=mm&r=g)
On 2/1/2021 5:16 PM, Christian Heimes wrote:
Yeah, you searched the wrong kind of system ;) Pick up a Windows machine, cross-platform code that originated on Windows, anything that interoperates with Java or .NET as well, or uses wxWidgets. I'm not defending the choice of wchar_t over UTF-8 (but I can: most of these systems chose Unicode before UTF-8 was invented and never took the backwards-incompatible change because they were so popular), but if we want to pragmatically weigh the needs of our users above our desire for purity, then we should try and support both equally wherever possible. Cheers, Steve
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 4:28 AM Steve Dower <steve.dower@python.org> wrote:
Note that we don't have "utf8 (char*) to Python bytes object" direct encoder API. If PEP 624 is accepted, utf8 and wchar_t* become equal. So please don't think PEP 624 neglect only wchar_t*. Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 12:43 AM M.-A. Lemburg <mal@egenix.com> wrote:
We already have PyUnicode_FromWideChar(). So I assume you mean "wchar_t* to Python bytes object".
OK, I see codecs.h has three encoders. * utf8_encode * utf16_encode * utf32_encode But there are 13 encoders in my PEP: PyUnicode_Encode() PyUnicode_EncodeASCII() PyUnicode_EncodeLatin1() PyUnicode_EncodeUTF7() PyUnicode_EncodeUTF8() PyUnicode_EncodeUTF16() PyUnicode_EncodeUTF32() PyUnicode_EncodeUnicodeEscape() PyUnicode_EncodeRawUnicodeEscape() PyUnicode_EncodeCharmap() PyUnicode_TranslateCharmap() PyUnicode_EncodeDecimal() PyUnicode_TransformDecimalToASCII() Do you want to keep all encoders? or 3 encoders?
That would keep extensions working after a recompile, since Py_UNICODE is already a typedef to wchar_t.
That idea is written in the PEP already. https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-wchar-t Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/0a2191a85455df6d2efdb22c7463c304.jpg?s=120&d=mm&r=g)
On 02.02.2021 00:33, Inada Naoki wrote:
Yes, that's what I meant. Encoding from wchar_t* to a Python bytes object. This is what the encoder APIs all implement. They have become less efficient with Python 3.3, but this can be resolved, while at the same time removing Py_UNICODE and replacing it with wchar_t in those encoder APIs.
We could keep all encoders, replacing Py_UNICODE with wchar_t in the API. For the ones where we have separate implementations as private functions, we can move back to direct encoding. For the others, we can keep using the temporary Unicode object or refactor the code to expose the native encoders working directly on the internal buffers as private functions and then use those in the same way for direct encoding. The Unicode API was meant and designed as a rich API, making it easy to use and providing a complete set for extension writers and CPython to use. I believe we should keep it that way.
Right and I think this is a more workable approach than removing APIs. BTW: I don't understand this comment: "They are inefficient on platforms wchar_t* is UTF-16. It is because built-in codecs supports only UCS-1, UCS-2, and UCS-4 input." Windows is one such platform. Java (indirectly) is another. They both store UTF-16LE in those arrays and Python's codecs handle this just fine. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 02 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 7:37 PM M.-A. Lemburg <mal@egenix.com> wrote:
I'm sorry about the section is not clear. For example, if wchar_t* is UCS4, ucs4_utf8_encoder() can encode wchar_t* into UTF-8. But when wchar_t* is UTF-16, ucs2_utf8_encoder() can not handle surrogate escape. We need to use a temporary Unicode object. That is what "inefficient" means. I will update the section more elaborate. Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/fd079755731ad59ebedad02b690340e8.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 3:47 AM Inada Naoki <songofacandy@gmail.com> wrote:
Since real UCS-2 is effectively dead, maybe it should be flipped around: Make UTF-16 be the efficient path and UCS-2 be the path that needs to round-trip through Unicode. But I suppose that's out of scope for this PEP. -Em
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 9:40 PM Emily Bowman <silverbacknet@gmail.com> wrote:
Note the ucs2_utf8_encoder() is used only for encoding Python Unicode object for now. Unicode object is latin1, UCS2, or UCS4. It never be UTF-16. So if we support add UTF-16 support to ucs2_utf8_encoder(), it means we need to add code and maintain only for PyUnicode_EncodeUTF8 (encode from wchar_t* into char*). I don't think it is a good deal. As described in the PEP, encoder APIs are used very rarely. We must not add any maintainece costs for them. Regards, -- Inada Naoki <songofacandy@gmail.com>
![](https://secure.gravatar.com/avatar/15b1cd41a4c23e7dc10893777afb4281.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 11:47 PM Inada Naoki <songofacandy@gmail.com> wrote:
I fixed tons of bugs related in Python 2.7 and Python 3 codecs before PEP 393 (compact strings) to handle properly 16-bit wchar_t: to handle properly surrogate characters. The implementation was complex and slow. I would prefer to not move backwards to that :-( If you are curious, look into PyUnicode_FromWideChar() implementation, search for find_maxchar_surrogates(), to have an idea of the cost of handling UTF-16 surrogate pairs. For a full codec, it's way more complex, painful to write and to maintain. I'm happy that we were able to remove that thanks to the PEP 393! Victor -- Night gathers, and now my watch begins. It shall not end until my death.
![](https://secure.gravatar.com/avatar/351a10f392414345ed67a05e986dc4dd.jpg?s=120&d=mm&r=g)
On Tue, Feb 2, 2021 at 8:40 PM Inada Naoki <songofacandy@gmail.com> wrote:
I updated the "Alternative Ideas" section of the PEP. https://www.python.org/dev/peps/pep-0624/#alternative-ideas They replaces `Py_UNICODE*` with `PyObject*`, `Py_UCS4*`, and `wchar_t*`. I explicitly noted that some codecs can bypass temporary Unicode objects: """ UTF-8, UTF-16, UTF-32 encoders support Py_UCS4 internally. So PyUnicode_EncodeUTF8(), PyUnicode_EncodeUTF16(), and PyUnicode_EncodeUTF32() can avoid to create a temporary Unicode object. """ -- Inada Naoki <songofacandy@gmail.com>
participants (9)
-
Antoine Pitrou
-
Christian Heimes
-
Emily Bowman
-
Inada Naoki
-
Jim J. Jewett
-
M.-A. Lemburg
-
Paul Moore
-
Steve Dower
-
Victor Stinner