Plan to remove Py_UNICODE APis except PEP 623.
Hi, all. I proposed PEP 623 to remove Unicode APIs deprecated by PEP 393. In this thread, I am proposing removal of Py_UNICODE (not Unicode objects) APIs deprecated by PEP 393. Please reply for any comments. ## Undocumented, have Py_DEPRECATED There is no problem to remove them in Python 3.10. I will just do it. * Py_UNICODE_str*** functions -- already removed in https://github.com/python/cpython/pull/21164 * PyUnicode_GetMax() ## Documented and have Py_DEPRECATED * PyLong_FromUnicode * PyUnicode_AsUnicodeCopy * PyUnicode_Encode * PyUnicode_EncodeUTF7 * PyUnicode_EncodeUTF8 * PyUnicode_EncodeUTF16 * PyUnicode_EncodeUTF32 * PyUnicode_EncodeUnicodeEscape * PyUnicode_EncodeRawUnicodeEscape * PyUnicode_EncodeLatin1 * PyUnicode_EncodeASCII * PyUnicode_EncodeCharmap * PyUnicode_TranslateCharmap * PyUnicode_EncodeMBCS These APIs are documented. The document has ``.. deprecated:: 3.3 4.0`` directive. They have been `Py_DEPRECATED` since Python 3.6 too. Plan: Change the document to ``.. deprecated:: 3.0 3.10`` and remove them in Python 3.10. ## PyUnicode_EncodeDecimal It is not documented. It has not been deprecated by Py_DEPRECATED. Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11. ## PyUnicode_TransformDecimalToASCII It is documented, but doesn't have ``deprecated`` directive. It is not deprecated by Py_DEPRECATED. Plan: Add Py_DEPRECATED and ``deprecated 3.3 3.11`` directive in 3.9, and remove it in 3.11. ## _PyUnicode_ToLowercase, _PyUnicode_ToUppercase They are not deprecated by PEP 393, but bpo-12736. They are documented as deprecated, but don't have ``Py_DEPRECATED``. Plan: Add Py_DEPRECATED in 3.9, and remove them in 3.11. Note: _PyUnicode_ToTitlecase has Py_DEPRECATED. It can be removed in 3.10. -- Inada Naoki <songofacandy@gmail.com>
Hi Inada-san, as you may remember, I wasn't happy with the deprecations of the APIs in PEP 393, since there are no C API alternatives for the encoding APIs deprecated in the PEP, which allow direct encoding provided by these important codecs. AFAIK, the situation hasn't changed since then. We can't just remove access to one half of a codec (the decoding part) without at least providing an alternative for C extensions to use. Py_UNICODE can be removed from the API, but only if there are alternative APIs which C extensions can use to the same effect. Given PEP 393, this would be APIs which use wchar_t instead of Py_UNICODE. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 28 2020)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/ On 28.06.2020 04:35, Inada Naoki wrote:
Hi, all.
I proposed PEP 623 to remove Unicode APIs deprecated by PEP 393.
In this thread, I am proposing removal of Py_UNICODE (not Unicode objects) APIs deprecated by PEP 393. Please reply for any comments.
## Undocumented, have Py_DEPRECATED
There is no problem to remove them in Python 3.10. I will just do it.
* Py_UNICODE_str*** functions -- already removed in https://github.com/python/cpython/pull/21164 * PyUnicode_GetMax()
## Documented and have Py_DEPRECATED
* PyLong_FromUnicode * PyUnicode_AsUnicodeCopy * PyUnicode_Encode * PyUnicode_EncodeUTF7 * PyUnicode_EncodeUTF8 * PyUnicode_EncodeUTF16 * PyUnicode_EncodeUTF32 * PyUnicode_EncodeUnicodeEscape * PyUnicode_EncodeRawUnicodeEscape * PyUnicode_EncodeLatin1 * PyUnicode_EncodeASCII * PyUnicode_EncodeCharmap * PyUnicode_TranslateCharmap * PyUnicode_EncodeMBCS
These APIs are documented. The document has ``.. deprecated:: 3.3 4.0`` directive. They have been `Py_DEPRECATED` since Python 3.6 too.
Plan: Change the document to ``.. deprecated:: 3.0 3.10`` and remove them in Python 3.10.
## PyUnicode_EncodeDecimal
It is not documented. It has not been deprecated by Py_DEPRECATED.
Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11.
## PyUnicode_TransformDecimalToASCII
It is documented, but doesn't have ``deprecated`` directive. It is not deprecated by Py_DEPRECATED.
Plan: Add Py_DEPRECATED and ``deprecated 3.3 3.11`` directive in 3.9, and remove it in 3.11.
## _PyUnicode_ToLowercase, _PyUnicode_ToUppercase
They are not deprecated by PEP 393, but bpo-12736. They are documented as deprecated, but don't have ``Py_DEPRECATED``.
Plan: Add Py_DEPRECATED in 3.9, and remove them in 3.11.
Note: _PyUnicode_ToTitlecase has Py_DEPRECATED. It can be removed in 3.10.
Hi, Lamburg. Thank you for quick response.
We can't just remove access to one half of a codec (the decoding part) without at least providing an alternative for C extensions to use.
Py_UNICODE can be removed from the API, but only if there are alternative APIs which C extensions can use to the same effect.
Given PEP 393, this would be APIs which use wchar_t instead of Py_UNICODE.
Decoding part is implemented as `const char *` -> `PyObject*` (Unicode object). I think this is reasonable since `const char *` is perfect to abstract the encoded string, In case of encoding part, `wchar_t *` is not perfect abstraction for (decoded) unicode string. Converting from Unicode object into `wchar_t *` is not zero-cost. I think `PyObject *` (Unicode object) -> `PyObject *` (bytes object) looks better signature than `wchar_t *` -> `Pyobject *` (bytes object) because for encoders. * Unicode object is more important than `wchar_t *` in Python. * All PyUnicode_EncodeXXX APIs are implemented with PyUnicode_FromWideChar. For example, we have these private encode APIs: * PyObject* _PyUnicode_AsAsciiString(PyObject *unicode, const char *errors) * PyObject* _PyUnicode_AsLatin1String(PyObject *unicode, const char *errors) * PyObject* _PyUnicode_AsUTF8String(PyObject *unicode, const char *errors) * PyObject* _PyUnicode_EncodeUTF16(PyObject *unicode, const char *errors, int byteorder) ... So how about making them public, instead of undeprecate Py_UNICODE* encode APIs? 1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10. Current private APIs can become macro (e.g. #define _PyUnicode_AsAsciiString PyUnicode_AsAsciiBytes), or deprecated static inline function. 2. Remove Py_UNICODE* encode APIs in Python 3.12. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Sun, Jun 28, 2020 at 11:24 PM Inada Naoki <songofacandy@gmail.com> wrote:
So how about making them public, instead of undeprecate Py_UNICODE* encode APIs?
1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10. Current private APIs can become macro (e.g. #define _PyUnicode_AsAsciiString PyUnicode_AsAsciiBytes), or deprecated static inline function. 2. Remove Py_UNICODE* encode APIs in Python 3.12.
More aggressive idea: override current PyUnicode_EncodeXXX() apis. Change from `Py_UNICODE *object` to `PyObject *unicode`. This idea might look crazy. But PyUnicode_EncodeXXX APIs are deprecated for a long time, and there are only a few users. I grepped from 3874 source packages in top 4000 downloaded packages. (126 packages are wheel-only) $ rg -w PyUnicode_EncodeASCII Cython-0.29.20/Cython/Includes/cpython/unicode.pxd 424: bytes PyUnicode_EncodeASCII(Py_UNICODE *s, Py_ssize_t size, char *errors) $ rg -w PyUnicode_EncodeLatin1 Cython-0.29.20/Cython/Includes/cpython/unicode.pxd 406: bytes PyUnicode_EncodeLatin1(Py_UNICODE *s, Py_ssize_t size, char *errors) $ rg -w PyUnicode_EncodeUTF7 (no output) $ rg -w PyUnicode_EncodeUTF8 subprocess32-3.5.4/_posixsubprocess_helpers.c 38: return PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(unicode), pyodbc-4.0.30/src/params.cpp 1932: bytes = PyUnicode_EncodeUTF8(source, cb, "strict"); pyodbc-4.0.30/src/cnxninfo.cpp 45: Object bytes(PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(p), PyUnicode_GET_SIZE(p), 0)); 50: Object bytes(PyUnicode_Check(p) ? PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(p), PyUnicode_GET_SIZE(p), 0) : 0); Cython-0.29.20/Cython/Includes/cpython/unicode.pxd 304: bytes PyUnicode_EncodeUTF8(Py_UNICODE *s, Py_ssize_t size, char *errors) Note that subprocess32 is Python 2 only project. Only pyodbc-4.0.30 use this API. https://github.com/mkleehammer/pyodbc/blob/b4ea03220dd8243e452c91689bef34823... https://github.com/mkleehammer/pyodbc/blob/master/src/cnxninfo.cpp#L45 Anyway, current PyUnicode_EncodeXXX APis are not used commonly. I don't think it's worth enough to undeprecate. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Mon, Jun 29, 2020 at 12:17 AM Inada Naoki <songofacandy@gmail.com> wrote:
More aggressive idea: override current PyUnicode_EncodeXXX() apis. Change from `Py_UNICODE *object` to `PyObject *unicode`.
This is a list of PyUnicode_EncodeXXXX usage in top4000 packages. https://gist.github.com/methane/0f97391c9dbf5b53a818aa39a8285a29 Scandir use PyUnicode_EncodeMBCS only in `#if PY_MAJOR_VERSION < 3 && defined(MS_WINDOWS)` block. So it is false positive. Cython has prototype of these APIs. pyodbc uses PyUnicode_EncodeUTF16 and PyUnicode_EncodeUTF8. But pyodbc is converting Unicode Object into bytes object. So current API is very inefficient. That's all. Now I think it is safe to override deprecated APIs with private APIs accepts Unicode Object. * _PyUnicode_EncodeUTF7 -> PyUnicode_EncodeUTF7 * _PyUnicode_AsUTF8String -> PyUnicode_EncodeUTF8 * _PyUnicode_EncodeUTF16 -> PyUnicode_EncodeUTF16 * _PyUnicode_EncodeUTF32 -> PyUnicode_EncodeUTF32 * _PyUnicode_AsLatin1String -> PyUnicode_EncodeLatin1 * _PyUnicode_AsASCIIString -> PyUnicode_EncodeASCII * _PyUnicode_EncodeCharmap -> PyUnicode_EncodeCharmap -- Inada Naoki <songofacandy@gmail.com>
Le dim. 28 juin 2020 à 04:39, Inada Naoki <songofacandy@gmail.com> a écrit :
## Documented and have Py_DEPRECATED
* PyLong_FromUnicode * PyUnicode_AsUnicodeCopy * PyUnicode_Encode * PyUnicode_EncodeUTF7 * PyUnicode_EncodeUTF8 * PyUnicode_EncodeUTF16 * PyUnicode_EncodeUTF32 * PyUnicode_EncodeUnicodeEscape * PyUnicode_EncodeRawUnicodeEscape * PyUnicode_EncodeLatin1 * PyUnicode_EncodeASCII * PyUnicode_EncodeCharmap * PyUnicode_TranslateCharmap * PyUnicode_EncodeMBCS
These APIs are documented. The document has ``.. deprecated:: 3.3 4.0`` directive. They have been `Py_DEPRECATED` since Python 3.6 too.
Plan: Change the document to ``.. deprecated:: 3.0 3.10`` and remove them in Python 3.10.
".. deprecated" markups are nice, but not easy to discover. I would help to add a "Deprecated" section of C API Changes and list functions scheduled for removal in the next Python version: https://docs.python.org/dev/whatsnew/3.9.html#c-api-changes I understand that these ".. deprecated" markups will be added to 3.8 and 3.9 documentation, right? For each function, I would be nice to suggest a replacement function. For example, PyUnicode_EncodeMBCS() (Py_UNICODE*) can be replaced with PyUnicode_EncodeCodePage() using code_page=CP_ACP (PyObject*).
## PyUnicode_EncodeDecimal
It is not documented. It has not been deprecated by Py_DEPRECATED. Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11.
I understood that the replacement function is the private _PyUnicode_TransformDecimalAndSpaceToASCII() function. This function is used by complex, float and int types to convert a string into a number.
## PyUnicode_TransformDecimalToASCII
It is documented, but doesn't have ``deprecated`` directive. It is not deprecated by Py_DEPRECATED.
Plan: Add Py_DEPRECATED and ``deprecated 3.3 3.11`` directive in 3.9, and remove it in 3.11.
I don't think that we need to expose such function as part of the public C API. IMHO it only was exposed to be consumed by Python itself. So I don't think that we need to provide a replacement function. After the function will be removed, if someone complains, we can design a new replacement function. But I prefer to not *guess* what is the exact use case.
## _PyUnicode_ToLowercase, _PyUnicode_ToUppercase
They are not deprecated by PEP 393, but bpo-12736.
They are documented as deprecated, but don't have ``Py_DEPRECATED``.
Plan: Add Py_DEPRECATED in 3.9, and remove them in 3.11.
Note: _PyUnicode_ToTitlecase has Py_DEPRECATED. It can be removed in 3.10.
bpo-12736 is "Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation". IMHO the replacement function is to call lower() and method() of a Python str object. If you change the 3.9 documentation, please also update 3.8 doc. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
Le dim. 28 juin 2020 à 11:22, M.-A. Lemburg <mal@egenix.com> a écrit :
as you may remember, I wasn't happy with the deprecations of the APIs in PEP 393, since there are no C API alternatives for the encoding APIs deprecated in the PEP, which allow direct encoding provided by these important codecs.
AFAIK, the situation hasn't changed since then.
I would prefer to analyze the list on a case by case basis. I don't think that it's useful to expose every single encoding supported by Python as a C function. I would prefer to only have a fast-path for the most common encodings: ASCII, Latin1, UTF-8, Windows ANSI code page. That's all. For any other encodings, the general PyUnicode_AsEncodedString() and PyUnicode_Decode() function are good enough. If someone expects an overhead of passing a string, please prove it with a benchmark. But IMO a small overhead is acceptable for rare encodings. Note: PyUnicode_AsEncodedString() and PyUnicode_Decode() also have "fast paths" for most common encodings: ASCII, UTF-8, "mbcs" (Python alias of the Windows ANSI code page), Latin1. But also UTF-16 and UTF-32: I'm not if it's really worth it to have these ones, but it was cheap to have them :-)
We can't just remove access to one half of a codec (the decoding part) without at least providing an alternative for C extensions to use.
I disagree, we can. The alternative exists since Python 2: PyUnicode_AsEncodedString() and PyUnicode_Decode().
Given PEP 393, this would be APIs which use wchar_t instead of Py_UNICODE.
Using wchar_t is inefficient on all platforms using 16-bit wchar_t since surrogate pairs need a special code path. For example, PyUnicode_FromWideChar() has to scan the string twice: the first time to count the number of surrogate pairs, to allocate the exact buffer size. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
Le dim. 28 juin 2020 à 17:21, Inada Naoki <songofacandy@gmail.com> a écrit :
More aggressive idea: override current PyUnicode_EncodeXXX() apis. Change from `Py_UNICODE *object` to `PyObject *unicode`.
This idea might look crazy. But PyUnicode_EncodeXXX APIs are deprecated for a long time, and there are only a few users. I grepped from 3874 source packages in top 4000 downloaded packages. (126 packages are wheel-only)
IMO it's a violation of the C API stability warranty. I would prefer to use different function names to ensure that building an old C extension fails with a compiler error, rather than emit a compiler warning and crash at runtime. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
Le lun. 29 juin 2020 à 08:41, Inada Naoki <songofacandy@gmail.com> a écrit :
That's all. Now I think it is safe to override deprecated APIs with private APIs accepts Unicode Object.
* _PyUnicode_EncodeUTF7 -> PyUnicode_EncodeUTF7
Use PyUnicode_AsEncodedString("UTF-7"). This encoding is not common enough to justify to have to maintain a public C API for just it. Adding public C API functions have a cost in CPython, but also in other Python implementations which then have to maintain it as well. The C API is too large, we have to make it smaller, not larger.
* _PyUnicode_AsUTF8String -> PyUnicode_EncodeUTF8
Use PyUnicode_AsUTF8String(), or PyUnicode_AsEncodedString() if you need to pass errors.
* _PyUnicode_EncodeUTF16 -> PyUnicode_EncodeUTF16
Use PyUnicode_AsUTF16String(), or PyUnicode_AsEncodedString() if you need to pass errors or the byte order.
* _PyUnicode_EncodeUTF32 -> PyUnicode_EncodeUTF32
Who use UTF32? There is PyUnicode_AsUTF32String().
* _PyUnicode_AsLatin1String -> PyUnicode_EncodeLatin1
PyUnicode_AsLatin1String()
* _PyUnicode_AsASCIIString -> PyUnicode_EncodeASCII
PyUnicode_AsASCIIString()
* _PyUnicode_EncodeCharmap -> PyUnicode_EncodeCharmap
PyUnicode_AsCharmapString() Victor -- Night gathers, and now my watch begins. It shall not end until my death.
Many existing public APIs doesn't have `const char *errors` argument. As there are very few users, we can ignore that limitation. On the other hand, some encoding have special options. * UTF16 and UTF32; `int byteorder` parameter. * UTF7; int base64SetO, int base64WhiteSpace So PyUnicode_AsEncodedString can not replace them. Regards, -- Inada Naoki <songofacandy@gmail.com>
Le lun. 29 juin 2020 à 12:36, Inada Naoki <songofacandy@gmail.com> a écrit :
* UTF16 and UTF32; `int byteorder` parameter.
* UTF-16 byte_order=0 means "UTF-16" encoding * UTF-16 byte_order<0 means "UTF-16-BE" encoding * UTF-16 byte_order>0 means "UTF-16-LE" encoding Same applies for UTF-32.
* UTF7; int base64SetO, int base64WhiteSpace
Does anyone use these parameters? I would prefer to ensure that they are used before continuing to maintain code to support these parameters. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On Mon, Jun 29, 2020 at 6:51 PM Victor Stinner <vstinner@python.org> wrote:
I understand that these ".. deprecated" markups will be added to 3.8 and 3.9 documentation, right?
They are documented as "Deprecated since version 3.3, will be removed in version 4.0" already. I am proposing s/4.0/3.10/ in 3.8 and 3.9 documents.
For each function, I would be nice to suggest a replacement function. For example, PyUnicode_EncodeMBCS() (Py_UNICODE*) can be replaced with PyUnicode_EncodeCodePage() using code_page=CP_ACP (PyObject*).
Of course.
## PyUnicode_EncodeDecimal
It is not documented. It has not been deprecated by Py_DEPRECATED. Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11.
I understood that the replacement function is the private _PyUnicode_TransformDecimalAndSpaceToASCII() function. This function is used by complex, float and int types to convert a string into a number.
Should we make it public?
## _PyUnicode_ToLowercase, _PyUnicode_ToUppercase
They are not deprecated by PEP 393, but bpo-12736.
They are documented as deprecated, but don't have ``Py_DEPRECATED``.
Plan: Add Py_DEPRECATED in 3.9, and remove them in 3.11.
Note: _PyUnicode_ToTitlecase has Py_DEPRECATED. It can be removed in 3.10.
bpo-12736 is "Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation". IMHO the replacement function is to call lower() and method() of a Python str object.
We have private functions; _PyUnicode_ToTitleFull, _PyUnicode_ToLowerFull, and _PyUnicode_ToUpperFull. I am not sure we should make them public too.
If you change the 3.9 documentation, please also update 3.8 doc.
I see. -- Inada Naoki <songofacandy@gmail.com>
Le lun. 29 juin 2020 à 12:50, Inada Naoki <songofacandy@gmail.com> a écrit :
## PyUnicode_EncodeDecimal
It is not documented. It has not been deprecated by Py_DEPRECATED. Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11.
I understood that the replacement function is the private _PyUnicode_TransformDecimalAndSpaceToASCII() function. This function is used by complex, float and int types to convert a string into a number.
Should we make it public?
In the past, we expose everything "just in case" someone would like to use it. 30 years later, the C API has hundreds of functions, we don't know which ones are used or not, the C API is not well tested, etc. Unless there is a clear user request with a strong use case which cannot be solved with existing functions, I suggest to *not* add any new C API function. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On 29.06.2020 11:57, Victor Stinner wrote:
Le dim. 28 juin 2020 à 11:22, M.-A. Lemburg <mal@egenix.com> a écrit :
as you may remember, I wasn't happy with the deprecations of the APIs in PEP 393, since there are no C API alternatives for the encoding APIs deprecated in the PEP, which allow direct encoding provided by these important codecs.
AFAIK, the situation hasn't changed since then.
I would prefer to analyze the list on a case by case basis. I don't think that it's useful to expose every single encoding supported by Python as a C function.
I designed the Unicode C API as a rich API, so that it's easy to use from C extensions and the interpreter as well. The main theme was to have symmetric API for both encoding and decoding. The PEP now suggests to remove the API on the basis of deprecating Py_UNICODE, which is a change in data type. This does not mean we have to give up the symmetry in the C API, or that the encoding APIs are now suddenly useless. It only means that we have to replace Py_UNICODE with one of the supported data for storing Unicode. Since the C world has adopted wchar_t for this purpose, it's the natural choice.
I would prefer to only have a fast-path for the most common encodings: ASCII, Latin1, UTF-8, Windows ANSI code page. That's all.
For any other encodings, the general PyUnicode_AsEncodedString() and PyUnicode_Decode() function are good enough.
PyUnicode_AsEncodedString() converts Unicode objects to a bytes object. This is not an symmetric replacement for the PyUnicode_Encode*() APIs, since those go from Py_UNICODE to a bytes object.
If someone expects an overhead of passing a string, please prove it with a benchmark. But IMO a small overhead is acceptable for rare encodings.
Note: PyUnicode_AsEncodedString() and PyUnicode_Decode() also have "fast paths" for most common encodings: ASCII, UTF-8, "mbcs" (Python alias of the Windows ANSI code page), Latin1. But also UTF-16 and UTF-32: I'm not if it's really worth it to have these ones, but it was cheap to have them :-)
We can't just remove access to one half of a codec (the decoding part) without at least providing an alternative for C extensions
Sorry, I meant the "encoding part".
to use.
I disagree, we can. The alternative exists since Python 2: PyUnicode_AsEncodedString() and PyUnicode_Decode().
See above. If we remove the direct encoding/decoding C APIs we should at the very least provide generic alternatives which can be used as drop-in replacement for the PyUnicde_Encode*() APIs.
Given PEP 393, this would be APIs which use wchar_t instead of Py_UNICODE.
Using wchar_t is inefficient on all platforms using 16-bit wchar_t since surrogate pairs need a special code path. For example, PyUnicode_FromWideChar() has to scan the string twice: the first time to count the number of surrogate pairs, to allocate the exact buffer size.
If you want full UCS4 compatibility, that's true, but those platforms suffer from this deficiency platform wide, so Python is in no way special. The main point is that wchar_t is the standard in C to represent Unicode code points, so it's a natural choice as replacement for Py_UNICODE. Since the C API is not only meant to be used by the CPython interpreter, we should stick to standards rather than expecting the world to adapt to our implementations. This also makes the APIs future proof, e.g. in case we make another transition from the current hybrid internal data type for Unicode towards UTF-8 buffers as internal data type. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On 6/30/20 7:53 AM, M.-A. Lemburg wrote:
Since the C world has adopted wchar_t for this purpose, it's the natural choice.
I would disagree with this comment. Microsoft Windows has chosen to use 'wchar_t' for Unicode, because they adopted UCS-2 before it morphed into UTF-16 due to the expansion of Unicode above 16 bits. The *nix side of the world has chosen to use UTF-8 as the preferred way to store Unicode characters. Also, in Windows, wchar_t doesn't really meet the requirements for what C defines wchar_t to mean, as wchar_t is supposed to represent every character as a single unit, and thus would need to be at least a 21 bit type (typically, it would be a 32 bit type), but Windows makes it a 16 bit type due to ABIs being locked before the Unicode expansion. -- Richard Damon
On 30/06/2020 13:16, Richard Damon wrote:
On 6/30/20 7:53 AM, M.-A. Lemburg wrote:
Since the C world has adopted wchar_t for this purpose, it's the natural choice.
I would disagree with this comment. Microsoft Windows has chosen to use 'wchar_t' for Unicode, because they adopted UCS-2 before it morphed into UTF-16 due to the expansion of Unicode above 16 bits. The *nix side of the world has chosen to use UTF-8 as the preferred way to store Unicode characters.
Also, in Windows, wchar_t doesn't really meet the requirements for what C defines wchar_t to mean, as wchar_t is supposed to represent every character as a single unit, and thus would need to be at least a 21 bit type (typically, it would be a 32 bit type), but Windows makes it a 16 bit type due to ABIs being locked before the Unicode expansion.
Seconded. I've had to do cross-platform (Linux and Windows)* unicode work in C. Using wchar_t was eventually rejected as infeasible. * Sorry, I had a Blues Brothers moment. -- Rhodri James *-* Kynesim Ltd
I completely agree with this, that UTF-8 has become the One True Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let alone composite emoji. So how to make that C-compatible? Make everything a void* and it just comes back with as many bytes as it gets? On Tue, Jun 30, 2020 at 5:22 AM Richard Damon <Richard@damon-family.org> wrote:
On 6/30/20 7:53 AM, M.-A. Lemburg wrote:
Since the C world has adopted wchar_t for this purpose, it's the natural choice.
I would disagree with this comment. Microsoft Windows has chosen to use 'wchar_t' for Unicode, because they adopted UCS-2 before it morphed into UTF-16 due to the expansion of Unicode above 16 bits. The *nix side of the world has chosen to use UTF-8 as the preferred way to store Unicode characters.
Also, in Windows, wchar_t doesn't really meet the requirements for what C defines wchar_t to mean, as wchar_t is supposed to represent every character as a single unit, and thus would need to be at least a 21 bit type (typically, it would be a 32 bit type), but Windows makes it a 16 bit type due to ABIs being locked before the Unicode expansion.
-- Richard Damon _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TA2ITVZY... Code of Conduct: http://python.org/psf/codeofconduct/
On 6/30/20 8:43 AM, Emily Bowman wrote:
I completely agree with this, that UTF-8 has become the One True Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let alone composite emoji.
So how to make that C-compatible? Make everything a void* and it just comes back with as many bytes as it gets?
Actually, in C you would tend to represent UTF-8 as a char* (or maybe an unsigned char*) type. This points out that straight 'ASCII' strings are also UTF-8, and that many of the string functions will actually work ok with UTF-8 strings. This was an intentional part of the design of UTF-8. Anything looking for specific character values will tend to 'just work', as long as those values really represent a character. The code also needs to take account of that now bytes != characters, so if you want to actually count how many characters are in a string, you need to be aware, and avoid splitting a string in the middle of a code-point, but a lot will still just work. -- Richard Damon
Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg <mal@egenix.com> a écrit :
I would prefer to analyze the list on a case by case basis. I don't think that it's useful to expose every single encoding supported by Python as a C function.
(...) This does not mean we have to give up the symmetry in the C API, or that the encoding APIs are now suddenly useless. It only means that we have to replace Py_UNICODE with one of the supported data for storing Unicode.
Let's agree to disagree :-) I don't think that completeness is a good rationale to design the C API. The C API is too large, we have to make it smaller. A specialized function, like PyUnicode_AsUTF8String(), can be justified by different reasons: * It is a very common use case and so it helps to write C extensions * It is significantly faster than the alternative generic function In C, you can execute arbitrary Python code by calling methods on Python str objects. For example, "abc".encode("utf-8", "surrogateescape") in Python becomes PyObject_CallMethod(obj, "encode", "ss", "utf-8", "surrogatepass") in C. Well, there is already a more specialized and generic PyUnicode_AsEncodedObject() function. We must not add a C API function for every single Python feature, otherwise it would be too expensive to maintain, and it would become impossible for other Python implementations to implement the fully C API. Well, even today, PyPy already only implements a small subset of the C API.
Since the C world has adopted wchar_t for this purpose, it's the natural choice.
In my experience, in C extensions, there are two kind of data: * bytes is used as a "char*": array of bytes * Unicode is used as a Python object For the very rare cases involving wchar_t*, PyUnicode_FromWideChar() can be used. I don't think that performance justifies to duplicate each function, once for a Python str object, once for wchar_t*. I mostly saw code involving wchar_t* to initialize Python. But this code was wrong since it used PyUnicode function *before* Python was initialized. That's bad and can now crash in recent Python versions. The new PEP 587 has a different design and avoids Python objects and anything related to the Python runtime: https://docs.python.org/dev/c-api/init_config.html#c.PyConfig_SetString Moreover, CPython implements functions taking wchar_t* string by calling PyUnicode_FromWideChar() internally...
PyUnicode_AsEncodedString() converts Unicode objects to a bytes object. This is not an symmetric replacement for the PyUnicode_Encode*() APIs, since those go from Py_UNICODE to a bytes object.
I don't see which feature is missing from PyUnicode_AsEncodedString(). If it's about parameters specific to some encodings like UTF-7, I already replied in another email.
Since the C API is not only meant to be used by the CPython interpreter, we should stick to standards rather than expecting the world to adapt to our implementations. This also makes the APIs future proof, e.g. in case we make another transition from the current hybrid internal data type for Unicode towards UTF-8 buffers as internal data type.
Do you know C extensions in the wild which are using wchar_t* on purpose? I haven't seen such a C extension yet. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On 30/06/2020 13:43, Emily Bowman wrote:
I completely agree with this, that UTF-8 has become the One True Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let alone composite emoji.
You say that as if it's a bad thing :-)
So how to make that C-compatible? Make everything a void* and it just comes back with as many bytes as it gets?
I'd be inclined to something like that. You really don't want people trying to roll their own UTF-8 handling if you can help it. That does imply the C API will need to be pretty comprehensive, though. (If you want nightmares, take a look at the parsing code in Expat. Multiple layers of macros and function tables make it a horror to comprehend.) -- Rhodri James *-* Kynesim Ltd
On 28.06.2020 16:24, Inada Naoki wrote:
Hi, Lamburg.
Thank you for quick response.
We can't just remove access to one half of a codec (the decoding part) without at least providing an alternative for C extensions to use.
Py_UNICODE can be removed from the API, but only if there are alternative APIs which C extensions can use to the same effect.
Given PEP 393, this would be APIs which use wchar_t instead of Py_UNICODE.
Decoding part is implemented as `const char *` -> `PyObject*` (Unicode object). I think this is reasonable since `const char *` is perfect to abstract the encoded string,
In case of encoding part, `wchar_t *` is not perfect abstraction for (decoded) unicode string.
Note that the PyUnicode_Encode*() APIs are meant to be make the codec's encoding machinery available to C extensions, so that they don't have to implement this again. In that sense, their purpose is not to encode an existing Unicode object, but instead, to provide access to the low-level buffer to bytes object encoding. The reasoning here is the same as for decoding: you have the original data you want to process available in some array and want to turn this into the Python object. The path Victor suggested requires always going via a Python Unicode object, but that it very expensive and not really an appropriate way to address the use case. As an example application, think of a database module which provides the Unicode data as Py_UNICODE buffer. You want to write this as UTF-8 data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API decode this for you into a bytes object which you can then write out using the Python C APIs for this.
Converting from Unicode object into `wchar_t *` is not zero-cost. I think `PyObject *` (Unicode object) -> `PyObject *` (bytes object) looks better signature than `wchar_t *` -> `Pyobject *` (bytes object) because for encoders.
See above. The motivation for these APIs is different. They are not about taking a Unicode object and converting it into bytes, they are deliberately about taking a data buffer as input and producing the Python bytes object as output (to implement symmetry between decoding and encoding).
* Unicode object is more important than `wchar_t *` in Python.
Right, but as I tried to explain in my reply to Victor, I designed the Unicode API in Python to be a rich API, which provides all necessary tools to easily work with Unicode in C extensions as well as in the CPython interpreter. The API is not only focused on what the CPython interpreter needs. It's an API which implements a concise interface to Unicode as used in Python.
* All PyUnicode_EncodeXXX APIs are implemented with PyUnicode_FromWideChar.
For example, we have these private encode APIs:
* PyObject* _PyUnicode_AsAsciiString(PyObject *unicode, const char *errors) * PyObject* _PyUnicode_AsLatin1String(PyObject *unicode, const char *errors) * PyObject* _PyUnicode_AsUTF8String(PyObject *unicode, const char *errors) * PyObject* _PyUnicode_EncodeUTF16(PyObject *unicode, const char *errors, int byteorder) ...
So how about making them public, instead of undeprecate Py_UNICODE* encode APIs?
I'd be fine with keeping just a generic PyUnicode_Encode() API, but this should then be encoding from a buffer to a bytes object. The above all take Unicode objects as input and create the same problem as I described above, with the temporary Unicode object being created and all the associated malloc and scanning overhead needed for this. The reason I mention wchar_t as new basis for the PyUnicde_Encode() API is that whcar_t has grown to be accepted as the standard for Unicode buffers in C. If you don't believe that this is good enough, we could also force Py_UCS4, but this would alienate Windows extension writers.
1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10. Current private APIs can become macro (e.g. #define _PyUnicode_AsAsciiString PyUnicode_AsAsciiBytes), or deprecated static inline function. 2. Remove Py_UNICODE* encode APIs in Python 3.12.
FWIW: I don't object to deprecating Py_UNICODE. I just don't want to lose the symmetry in decoding/encoding and add the cost of having to go via a Python Unicode object just to decode to bytes. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On 30.06.2020 15:17, Victor Stinner wrote:
Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg <mal@egenix.com> a écrit :
I would prefer to analyze the list on a case by case basis. I don't think that it's useful to expose every single encoding supported by Python as a C function.
(...) This does not mean we have to give up the symmetry in the C API, or that the encoding APIs are now suddenly useless. It only means that we have to replace Py_UNICODE with one of the supported data for storing Unicode.
Let's agree to disagree :-)
I don't think that completeness is a good rationale to design the C API.
Oh, if that's your opinion, then we definitely disagree :-) I strongly believe that the success of Python was in major parts built on the fact that Python does have a complete and easily usable C API. Without this, Python would never have convinced the "Python is slow" advocates that you can actually build fast applications in Python by using Python to orchestrate and integrate with low level C libraries, and we'd be regarded as yet another Tcl.
The C API is too large, we have to make it smaller.
That*s a different discussion, but disagree on that perspective as well: we have to refactor parts of the Python C API to make it more consistent and remove hacks which developers sometimes added as helper functions without considering the big picture approach. The Unicode API has over the year grown a lot of such helpers and there's certainly room for improvement, but simply ripping out things is not always the right answer, esp. not when you touch the very core of the design.
A specialized function, like PyUnicode_AsUTF8String(), can be justified by different reasons:
* It is a very common use case and so it helps to write C extensions * It is significantly faster than the alternative generic function
In C, you can execute arbitrary Python code by calling methods on Python str objects. For example, "abc".encode("utf-8", "surrogateescape") in Python becomes PyObject_CallMethod(obj, "encode", "ss", "utf-8", "surrogatepass") in C. Well, there is already a more specialized and generic PyUnicode_AsEncodedObject() function.
You know as well as I do, that the Python call mechanism is by far the slowest part in the Python C API, so telling developers to use this as the main way to run tasks which can be run much faster, easier and with less memory overhead or copying of data by directly calling a simple C API, is not a good way to advocate for a useful Python C API.
We must not add a C API function for every single Python feature, otherwise it would be too expensive to maintain, and it would become impossible for other Python implementations to implement the fully C API. Well, even today, PyPy already only implements a small subset of the C API.
I honestly don't think that other Python implementations should even try to implement the Python C API. Instead, they should build a bridge to use the CPython runtime and integrate this into their system.
Since the C world has adopted wchar_t for this purpose, it's the natural choice.
In my experience, in C extensions, there are two kind of data:
* bytes is used as a "char*": array of bytes * Unicode is used as a Python object
Uhm, what about all those applications, libraries and OS calls producing Unicode data ? It is not always feasible or even desired to first convert this into a Python Unicode object.
For the very rare cases involving wchar_t*, PyUnicode_FromWideChar() can be used. I don't think that performance justifies to duplicate each function, once for a Python str object, once for wchar_t*. I mostly saw code involving wchar_t* to initialize Python. But this code was wrong since it used PyUnicode function *before* Python was initialized. That's bad and can now crash in recent Python versions.
But that*s an entirely unrelated issue, right ? The C lib has full support for wchar_t and provides plenty of APIs for using it. The main() invocation is just one small part of the lib C Unicode system.
The new PEP 587 has a different design and avoids Python objects and anything related to the Python runtime: https://docs.python.org/dev/c-api/init_config.html#c.PyConfig_SetString
Moreover, CPython implements functions taking wchar_t* string by calling PyUnicode_FromWideChar() internally...
I mentioned wchar_t as buffer input replacement for the PyUnicode_Encode*() API as alternative to the deprecated Py_UNICODE. Of course, you can convert all whcar_t data into a Python Unicode object first and then apply operations on this, but the point of the encode APIs is to have a low-level access to the Python codecs which works directly on a data buffer - not a Unicode object. Again, with the main intent to avoid unnecessary copying of data, scanning, preparing, etc. etc. as is needed for PyUnicode_FromWideChar().
PyUnicode_AsEncodedString() converts Unicode objects to a bytes object. This is not an symmetric replacement for the PyUnicode_Encode*() APIs, since those go from Py_UNICODE to a bytes object.
I don't see which feature is missing from PyUnicode_AsEncodedString(). If it's about parameters specific to some encodings like UTF-7, I already replied in another email.
The symmetry is about buffer -> Python object. Decoding takes a byte stream data buffer and converts it into a Python Unicode object. Encoding takes a Unicode data buffer and converts is into a Python bytes object. There*s nothing missing in PyUnicode_AsEncodedString() (except perhaps for some extra encoding parameters), but it's not a proper replacement for the buffer -> Python object APIs I'm talking about.
Since the C API is not only meant to be used by the CPython interpreter, we should stick to standards rather than expecting the world to adapt to our implementations. This also makes the APIs future proof, e.g. in case we make another transition from the current hybrid internal data type for Unicode towards UTF-8 buffers as internal data type.
Do you know C extensions in the wild which are using wchar_t* on purpose? I haven't seen such a C extension yet.
Yes, of course. Any library which supports standards will have to deal with wchar_t, since it is the standard :-) Whether wchar_t and it's representations on various platforms is a good choice, is a different discussion (and one we had many many times in the past). The main reason for Python to adopt UCS4 was that the Linux glibc used it for wchar_t. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
As an example application, think of a database module which provides the Unicode data as Py_UNICODE buffer. You want to write this as UTF-8 data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API decode this for you into a bytes object which you can then write out using the Python C APIs for this. But based on Victor's survey of usages in Python extensions, which found few to no uses of these APIs, it would seem that hypothetical applications are insufficient to justify the continued provision and
On 7/1/2020 1:20 PM, M.-A. Lemburg wrote: maintenance of these APIs. After all, Python extensions are written as a way to interface "other stuff" to Python, and converting data to/from Python objects seems far more likely than converting data from one non-Python format to a different non-Python format. Not that such applications couldn't be written as Python extensions, but ... are they? ... and why? A rich interface is nice, but an unused interface is a burden.
On Thu, Jul 2, 2020 at 5:20 AM M.-A. Lemburg <mal@egenix.com> wrote:
The reasoning here is the same as for decoding: you have the original data you want to process available in some array and want to turn this into the Python object.
The path Victor suggested requires always going via a Python Unicode object, but that it very expensive and not really an appropriate way to address the use case.
But current PyUnicode_Encode* APIs does `PyUnicode_FromWideChar`. It is no direct API already. Additionally, pyodbc, the only user of the encoder API, did PyUnicode_EncodeUTF16(PyUnicode_AsUnicode(unicode), ...) It is very inefficient. Unicode Object -> Py_UNICODE* -> Unicode Object -> byte object. And as many others already said, most C world use UTF-8 for Unicode representation in C, not wchar_t. So I don't want to undeprecate current API.
As an example application, think of a database module which provides the Unicode data as Py_UNICODE buffer.
Py_UNICODE is deprecated. So I assume you are talking about wchar_t.
You want to write this as UTF-8 data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API decode this for you into a bytes object which you can then write out using the Python C APIs for this.
PyUnicode_FromWideChar + PyUnicode_AsUTF8AndSize is better than PyUnicode_EncodeUTF8. PyUnicode_EncodeUTF8 allocate temporary Unicode object anyway. So it needs to allocate Unicode object *and* char* buffer for UTF-8. On the other hand, PyUnicode_AsUTF8AndSize can just expose internal data when it is plain ASCII. Since ASCII string is very common, this is effective optimization. Regards, -- Inada Naoki <songofacandy@gmail.com>
On 29 Jun 2020, at 10:57, Victor Stinner <vstinner@python.org> wrote:
I would prefer to only have a fast-path for the most common encodings: ASCII, Latin1, UTF-8, Windows ANSI code page. That's all.
It's not obvious to me why the latin1 encoding is in this list as its just one of all the 8-bit char sets. Why is it needed? Barry
Latin-1 is the encoding that maps every byte (0-255) to the Unicode character with the same number. So it's special in that sense, and it gets used when mapping 8-bit bytes via Unicode "without encoding". Excuse my imprecise language here, I don't know the correct Unicode terms without going & looking them up. Paul On Thu, 2 Jul 2020 at 13:48, Barry Scott <barry@barrys-emacs.org> wrote:
On 29 Jun 2020, at 10:57, Victor Stinner <vstinner@python.org> wrote:
I would prefer to only have a fast-path for the most common encodings: ASCII, Latin1, UTF-8, Windows ANSI code page. That's all.
It's not obvious to me why the latin1 encoding is in this list as its just one of all the 8-bit char sets. Why is it needed?
Barry
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XGJ5NG4W... Code of Conduct: http://python.org/psf/codeofconduct/
On 30 Jun 2020, at 13:43, Emily Bowman <silverbacknet@gmail.com> wrote:
I completely agree with this, that UTF-8 has become the One True Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let alone composite emoji.
I use UCS-32 in my extensions, but never persist UCS-32 for which I use UTF-8. If you are calling WIN32 "unicode" APIs then you need UCS-16. My plan with PyCXX is to replace Py_UNICODE with UCS-32. I think all the UCS-32 APIs will still be present. Once I add that support to PyCXX all my users should easily port to a non-Py_UNICODE world. Barry
Le jeu. 2 juil. 2020 à 14:44, Barry Scott <barry@barrys-emacs.org> a écrit :
It's not obvious to me why the latin1 encoding is in this list as its just one of all the 8-bit char sets. Why is it needed?
The Latin-1 (ISO 8859-1) charset is kind of special: it maps bytes 0x00-0xFF to Unicode characters U+0000-U+00FF and decoding from latin1 cannot fail. It was commonly used as the locale encoding in Europe 10 years ago, but nowadays most Linux distributions use UTF-8 as the locale encoding. I'm also fine with restricting the list to 3 encodings: ASCII, UTF-8 and Windows ANSI code page. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
UCS-2 means units of 16 bits so it's limited to Unicode BMP: U+0000-U+FFFF. UCS-4 means units of 32 bits and so gives access to the whole (current) Unicode character set. Do you mean UTF-16 and UTF-32? UTF-16 supports the whole Unicode character set but uses the annoying surrogate pairs for characters outside the BMP.* UTF-32 is UCS-4 in practice. Victor Le jeu. 2 juil. 2020 à 15:08, Barry Scott <barry@barrys-emacs.org> a écrit :
On 30 Jun 2020, at 13:43, Emily Bowman <silverbacknet@gmail.com> wrote:
I completely agree with this, that UTF-8 has become the One True Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let alone composite emoji.
I use UCS-32 in my extensions, but never persist UCS-32 for which I use UTF-8.
If you are calling WIN32 "unicode" APIs then you need UCS-16.
My plan with PyCXX is to replace Py_UNICODE with UCS-32. I think all the UCS-32 APIs will still be present.
Once I add that support to PyCXX all my users should easily port to a non-Py_UNICODE world.
Barry
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YIKT5XGP... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death.
On 2020-07-02 14:57, Victor Stinner wrote:
Le jeu. 2 juil. 2020 à 14:44, Barry Scott <barry@barrys-emacs.org> a écrit :
It's not obvious to me why the latin1 encoding is in this list as its just one of all the 8-bit char sets. Why is it needed?
The Latin-1 (ISO 8859-1) charset is kind of special: it maps bytes 0x00-0xFF to Unicode characters U+0000-U+00FF and decoding from latin1 cannot fail.
This apparently makes it useful for not-quite-text, not-quite-bytes protocols like HTTP. In particular, WSGI (PEP 3333) uses latin-1 for headers.
It was commonly used as the locale encoding in Europe 10 years ago, but nowadays most Linux distributions use UTF-8 as the locale encoding.
I'm also fine with restricting the list to 3 encodings: ASCII, UTF-8 and Windows ANSI code page.
On 7/2/20 10:19 AM, Victor Stinner wrote:
Do you mean UTF-16 and UTF-32? UTF-16 supports the whole Unicode character set but uses the annoying surrogate pairs for characters outside the BMP.*
Minor quibble, UTF-16 handles all of the CURRENTLY defined Unicode set, and there is a currently a promise not to extend Unicode past that, but at some point they may need to break that promise. UTF-8, as previously defined (and could be again) easily handles U+00000000 to U+7FFFFFFF. UTF-16 can handle via the surrogate pairs U+00000000 to U+0010FFFF and stop there, To extend past that would require some form of heroics, which is the reason that U+0010FFFF is currently defined as the highest possible code point, as to allow a higher value breaks UTF-16, and there currently isn't a desire to do so. At some point in the distant future, we may run out of 'valid' code points, and this promise will need to be broken. UTF-16 grew out of a need to fix what has become UCS-2, which is the encoding used for earlier Unicode standards, before the need for code points above U+0000FFFF (now the BMP) was seen. -- Richard Damon
participants (10)
-
Barry Scott
-
Emily Bowman
-
Glenn Linderman
-
Inada Naoki
-
M.-A. Lemburg
-
Paul Moore
-
Petr Viktorin
-
Rhodri James
-
Richard Damon
-
Victor Stinner