[capi-sig]PyUnicode C API

On 07.09.2018 10:22, Victor Stinner wrote:
Le ven. 7 sept. 2018 à 10:33, M.-A. Lemburg <mal@egenix.com> a écrit :
The PyUnicode C API is not only an issue for PyPy, it's also an issue for CPython. When the PEP 393 has been implemented, suddly, most of the PyUnicode API has been directly deprecated: all functions using the now legacy Py_UNICODE* type...
Python 3.7 still has to support both the legacy Py_UNICODE* API and the new "compact string" API. It makes the CPython code base way more complex that it should be: any function accepting a string is supposed to call PyUnicode_Ready() and handle error properly. I would prefer to be able to remove the legacy PyUnicodeObject type, to only use compact strings everywhere.
Let me elaborate what are good and bad functions for PyUnicode.
Example of bad APIs:
- PyUnicode_IS_COMPACT(): this API really rely on the *current* implementation
- PyUnicode_2BYTE_DATA(): should only be used internally, there is no need to export it
- PyUnicode_READ()
- Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail
Good API:
- PyUnicode_Concat(): C API for str + str
- PyUnicode_Split()
- PyUnicode_FindChar()
Border line:
- PyUnicode_IS_ASCII(op): it's a O(1) operation on CPython, but it can O(n) on other implementations (like PyPy which uses UTF-8). But we also added str.isascii() in Python 3.7....
- PyUnicode_READ_CHAR()
- PyUnicode_CompareWithASCIIString(): the function name announces ASCII but decodes the byte string from Latin1 :-)
Victor

Victor Stinner schrieb am 07.09.2018 um 11:57:
Well, it's not like anyone asked them to do that. ;)
Since we control all paths that can lead to the creation of a Unicode object, why not make sure we always call the internal equivalent to PyType_Ready() before even returning it to the user, and replace the public PyType_Ready() function by a no-op macro?
It would probably slow things down on Windows (where Py_UNICODE is still a thing) in certain scenarios, specifically, when passing Unicode strings in and out of the Windows API, possibly including to print them. But then, how performance critical are these scenarios, really? Large data sets would almost always come from some kind of byte encoded device, be it a local file or a network connection, or leave the system as a byte encoded sequence.
So, I think this is a point where we can reconsider the original design.
We could also keep the current API on Windows and only use a constant PyType_Ready() on unixy systems, although that would obviously mean that people still have to call it (in order to support Windows) and we can't just remove it at some point. And it would be more difficult to detect bugs when users forget to call it.
Is that function really used?
As a quick and incomplete indication, github seems to only find forks of CPython and cpyext that contain it, at least from a quick jump through the tons of identical results.
https://github.com/search?q=pyunicode_is_compact&type=Code
That should be easy to emulate with a different internal representation, though. Just use an arbitrary value for "kind" and ignore it on the way in.
That reminds me that the current C-API lacks a way to calculate the size of the character buffer in bytes. You have to use "length * kind", which really relies on internals then. There should be a macro for this.
But given that "Py_UNICODE" is already deprecated … :)
It's not like anyone asked PyPy to use UTF-8 internally. ;) But even then, keeping a set of string property flags around really can't be that expensive, and might easily pay off in encoder implementations, e.g. by knowing the target buffer size upfront. ASCII-only strings are still excessively common.
I was going to write "is anyone actually using that, given how expensive it is?", but then realised that we're using it in Cython to implement Unicode string indexing. :) That's probably also its only use case.
Which makes total sense, because, why wouldn't it? But the name is obviously … underadapted.
Stefan

Le ven. 7 sept. 2018 à 14:25, Stefan Behnel <python_capi@behnel.de> a écrit :
PyPy3 uses UTF-8 internally
Well, it's not like anyone asked them to do that. ;)
Please don't be so focused on PyPy. There are other implementations of Python like MicroPython, IronPython, RustPython, etc. Why should everybody mimick the *exact* implementaton of CPython? What's the point of having a different implementation of Python if you *must* use exactly all design choices than CPython?
I repeat, the C API is a already threat to *CPython*. (See what I wrote about all Py_UNICODE* APIs.)
It's PyUnicode_Ready(), not PyType_Ready() :-)
No, we don't control anything: PyUnicode_FromStringAndSize(NULL, size) remains accessible in the C API and I'm sure that *many* C extensions use it. This function creates a string using Py_UNICODE* internally. There is also PyUnicode_FromUnicode() and many other functions. We only started to deprecated them in the documentation recently. They are only deprecated in the documentation, Py_DEPRECATED() is not used in C headers yet:
PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode( const Py_UNICODE *u, /* Unicode buffer */ Py_ssize_t size /* size of buffer */ ) /* Py_DEPRECATED(3.3) */;
That's why we must keep PyUnicode_Ready() in almost all functions accepting strings...
Used or not, it's currently part of the C API. I propose to kick it out of the public exported API, and really make it private. For example, just add a _ prefix: _PyUnicode_IS_COMPACT().
This function is heavily used internally in unicodeobject.c. I'm not sure why anyone would use it outside Python :-)
It's just one example of public API which should not be public.
This function doesn't make sense on Python 3.2 and older, on PyPy, etc. It should just be made private.
I don't think that it's a good idea to provide such function, since the size of one character depends on the size of other characters in a string when using compact strings...
Well, "deprecated"...
Victor

Le ven. 7 sept. 2018 à 18:31, Antoine Pitrou <antoine@python.org> a écrit :
Why is PyUnicode_FromStringAndSize a problem?
Well, it's not a problem: it works and is supported :-) But this function creates a string in the legacy Py_UNICODE* format. Later, PyUnicode_Ready() must be called to convert it to a compact string.
It's just an example of API of technical debt and legacy API.
Victor

Le 07/09/2018 à 19:05, Victor Stinner a écrit :
Why can't we simply change the implementation?
It's just an example of API of technical debt and legacy API.
The documentation for PyUnicode_FromStringAndSize() says:
""" Create a Unicode object from the char buffer u. The bytes will be interpreted as being UTF-8 encoded. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object, i.e. modification of the data is not allowed.
If u is NULL, this function behaves like PyUnicode_FromUnicode() with the buffer set to NULL. This usage is deprecated in favor of PyUnicode_New(). """
So there is a deprecated usage which can just be turned into an error. Otherwise I don't see the problem. PyUnicode_FromStringAndSize() exposes a reasonable feature and is necessary for many applications.
Regards
Antoine.

Victor Stinner schrieb am 07.09.2018 um 11:57:
Well, it's not like anyone asked them to do that. ;)
Since we control all paths that can lead to the creation of a Unicode object, why not make sure we always call the internal equivalent to PyType_Ready() before even returning it to the user, and replace the public PyType_Ready() function by a no-op macro?
It would probably slow things down on Windows (where Py_UNICODE is still a thing) in certain scenarios, specifically, when passing Unicode strings in and out of the Windows API, possibly including to print them. But then, how performance critical are these scenarios, really? Large data sets would almost always come from some kind of byte encoded device, be it a local file or a network connection, or leave the system as a byte encoded sequence.
So, I think this is a point where we can reconsider the original design.
We could also keep the current API on Windows and only use a constant PyType_Ready() on unixy systems, although that would obviously mean that people still have to call it (in order to support Windows) and we can't just remove it at some point. And it would be more difficult to detect bugs when users forget to call it.
Is that function really used?
As a quick and incomplete indication, github seems to only find forks of CPython and cpyext that contain it, at least from a quick jump through the tons of identical results.
https://github.com/search?q=pyunicode_is_compact&type=Code
That should be easy to emulate with a different internal representation, though. Just use an arbitrary value for "kind" and ignore it on the way in.
That reminds me that the current C-API lacks a way to calculate the size of the character buffer in bytes. You have to use "length * kind", which really relies on internals then. There should be a macro for this.
But given that "Py_UNICODE" is already deprecated … :)
It's not like anyone asked PyPy to use UTF-8 internally. ;) But even then, keeping a set of string property flags around really can't be that expensive, and might easily pay off in encoder implementations, e.g. by knowing the target buffer size upfront. ASCII-only strings are still excessively common.
I was going to write "is anyone actually using that, given how expensive it is?", but then realised that we're using it in Cython to implement Unicode string indexing. :) That's probably also its only use case.
Which makes total sense, because, why wouldn't it? But the name is obviously … underadapted.
Stefan

Le ven. 7 sept. 2018 à 14:25, Stefan Behnel <python_capi@behnel.de> a écrit :
PyPy3 uses UTF-8 internally
Well, it's not like anyone asked them to do that. ;)
Please don't be so focused on PyPy. There are other implementations of Python like MicroPython, IronPython, RustPython, etc. Why should everybody mimick the *exact* implementaton of CPython? What's the point of having a different implementation of Python if you *must* use exactly all design choices than CPython?
I repeat, the C API is a already threat to *CPython*. (See what I wrote about all Py_UNICODE* APIs.)
It's PyUnicode_Ready(), not PyType_Ready() :-)
No, we don't control anything: PyUnicode_FromStringAndSize(NULL, size) remains accessible in the C API and I'm sure that *many* C extensions use it. This function creates a string using Py_UNICODE* internally. There is also PyUnicode_FromUnicode() and many other functions. We only started to deprecated them in the documentation recently. They are only deprecated in the documentation, Py_DEPRECATED() is not used in C headers yet:
PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode( const Py_UNICODE *u, /* Unicode buffer */ Py_ssize_t size /* size of buffer */ ) /* Py_DEPRECATED(3.3) */;
That's why we must keep PyUnicode_Ready() in almost all functions accepting strings...
Used or not, it's currently part of the C API. I propose to kick it out of the public exported API, and really make it private. For example, just add a _ prefix: _PyUnicode_IS_COMPACT().
This function is heavily used internally in unicodeobject.c. I'm not sure why anyone would use it outside Python :-)
It's just one example of public API which should not be public.
This function doesn't make sense on Python 3.2 and older, on PyPy, etc. It should just be made private.
I don't think that it's a good idea to provide such function, since the size of one character depends on the size of other characters in a string when using compact strings...
Well, "deprecated"...
Victor

Le ven. 7 sept. 2018 à 18:31, Antoine Pitrou <antoine@python.org> a écrit :
Why is PyUnicode_FromStringAndSize a problem?
Well, it's not a problem: it works and is supported :-) But this function creates a string in the legacy Py_UNICODE* format. Later, PyUnicode_Ready() must be called to convert it to a compact string.
It's just an example of API of technical debt and legacy API.
Victor

Le 07/09/2018 à 19:05, Victor Stinner a écrit :
Why can't we simply change the implementation?
It's just an example of API of technical debt and legacy API.
The documentation for PyUnicode_FromStringAndSize() says:
""" Create a Unicode object from the char buffer u. The bytes will be interpreted as being UTF-8 encoded. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object, i.e. modification of the data is not allowed.
If u is NULL, this function behaves like PyUnicode_FromUnicode() with the buffer set to NULL. This usage is deprecated in favor of PyUnicode_New(). """
So there is a deprecated usage which can just be turned into an error. Otherwise I don't see the problem. PyUnicode_FromStringAndSize() exposes a reasonable feature and is necessary for many applications.
Regards
Antoine.
participants (3)
-
Antoine Pitrou
-
Stefan Behnel
-
Victor Stinner