[capi-sig]PyUnicode C API
On 07.09.2018 10:22, Victor Stinner wrote:
I'm in discussion with PyPy developers, and they reported different APIs which cause them troubles: (...)
- almost all PyUnicode API functions have to go according to them. PyPy3 uses UTF-8 internally, CPython uses "compact string" (array of Py_UCS1, Py_UCS2 or Py_UCS4 depending on the string content). https://pythoncapi.readthedocs.io/bad_api.html#pypy-requests
Le ven. 7 sept. 2018 à 10:33, M.-A. Lemburg <mal@egenix.com> a écrit :
I'm -1 on removing the PyUnicode APIs. We deliberately created a useful and very complete C API for Unicode.
The fact that PyPy chose to use a different internal representation is not a good reason to remove APIs and have CPython extension take the hit as a result. It would be better for PyPy rethink the internal representation or create a shim API which translates between the two worlds.
Note that UTF-8 is not a good internal representation for Unicode if you want fast indexing and slicing. This is why we are using fixed code units to represent the Unicode strings.
The PyUnicode C API is not only an issue for PyPy, it's also an issue for CPython. When the PEP 393 has been implemented, suddly, most of the PyUnicode API has been directly deprecated: all functions using the now legacy Py_UNICODE* type...
Python 3.7 still has to support both the legacy Py_UNICODE* API and the new "compact string" API. It makes the CPython code base way more complex that it should be: any function accepting a string is supposed to call PyUnicode_Ready() and handle error properly. I would prefer to be able to remove the legacy PyUnicodeObject type, to only use compact strings everywhere.
Let me elaborate what are good and bad functions for PyUnicode.
Example of bad APIs:
- PyUnicode_IS_COMPACT(): this API really rely on the *current* implementation
- PyUnicode_2BYTE_DATA(): should only be used internally, there is no need to export it
- PyUnicode_READ()
- Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail
Good API:
- PyUnicode_Concat(): C API for str + str
- PyUnicode_Split()
- PyUnicode_FindChar()
Border line:
- PyUnicode_IS_ASCII(op): it's a O(1) operation on CPython, but it can O(n) on other implementations (like PyPy which uses UTF-8). But we also added str.isascii() in Python 3.7....
- PyUnicode_READ_CHAR()
- PyUnicode_CompareWithASCIIString(): the function name announces ASCII but decodes the byte string from Latin1 :-)
Victor
Victor Stinner schrieb am 07.09.2018 um 11:57:
On 07.09.2018 10:22, Victor Stinner wrote:
I'm in discussion with PyPy developers, and they reported different APIs which cause them troubles: (...)
- almost all PyUnicode API functions have to go according to them. PyPy3 uses UTF-8 internally
Well, it's not like anyone asked them to do that. ;)
Python 3.7 still has to support both the legacy Py_UNICODE* API and the new "compact string" API. It makes the CPython code base way more complex that it should be: any function accepting a string is supposed to call PyUnicode_Ready() and handle error properly. I would prefer to be able to remove the legacy PyUnicodeObject type, to only use compact strings everywhere.
Since we control all paths that can lead to the creation of a Unicode object, why not make sure we always call the internal equivalent to PyType_Ready() before even returning it to the user, and replace the public PyType_Ready() function by a no-op macro?
It would probably slow things down on Windows (where Py_UNICODE is still a thing) in certain scenarios, specifically, when passing Unicode strings in and out of the Windows API, possibly including to print them. But then, how performance critical are these scenarios, really? Large data sets would almost always come from some kind of byte encoded device, be it a local file or a network connection, or leave the system as a byte encoded sequence.
So, I think this is a point where we can reconsider the original design.
We could also keep the current API on Windows and only use a constant PyType_Ready() on unixy systems, although that would obviously mean that people still have to call it (in order to support Windows) and we can't just remove it at some point. And it would be more difficult to detect bugs when users forget to call it.
Let me elaborate what are good and bad functions for PyUnicode.
Example of bad APIs:
- PyUnicode_IS_COMPACT(): this API really rely on the *current* implementation
Is that function really used?
As a quick and incomplete indication, github seems to only find forks of CPython and cpyext that contain it, at least from a quick jump through the tons of identical results.
https://github.com/search?q=pyunicode_is_compact&type=Code
- PyUnicode_READ()
That should be easy to emulate with a different internal representation, though. Just use an arbitrary value for "kind" and ignore it on the way in.
That reminds me that the current C-API lacks a way to calculate the size of the character buffer in bytes. You have to use "length * kind", which really relies on internals then. There should be a macro for this.
- Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail
But given that "Py_UNICODE" is already deprecated … :)
Good API:
- PyUnicode_Concat(): C API for str + str
- PyUnicode_Split()
- PyUnicode_FindChar()
Border line:
- PyUnicode_IS_ASCII(op): it's a O(1) operation on CPython, but it can O(n) on other implementations (like PyPy which uses UTF-8). But we also added str.isascii() in Python 3.7....
It's not like anyone asked PyPy to use UTF-8 internally. ;) But even then, keeping a set of string property flags around really can't be that expensive, and might easily pay off in encoder implementations, e.g. by knowing the target buffer size upfront. ASCII-only strings are still excessively common.
- PyUnicode_READ_CHAR()
I was going to write "is anyone actually using that, given how expensive it is?", but then realised that we're using it in Cython to implement Unicode string indexing. :) That's probably also its only use case.
- PyUnicode_CompareWithASCIIString(): the function name announces ASCII but decodes the byte string from Latin1 :-)
Which makes total sense, because, why wouldn't it? But the name is obviously … underadapted.
Stefan
Le ven. 7 sept. 2018 à 14:25, Stefan Behnel <python_capi@behnel.de> a écrit :
PyPy3 uses UTF-8 internally
Well, it's not like anyone asked them to do that. ;)
Please don't be so focused on PyPy. There are other implementations of Python like MicroPython, IronPython, RustPython, etc. Why should everybody mimick the *exact* implementaton of CPython? What's the point of having a different implementation of Python if you *must* use exactly all design choices than CPython?
I repeat, the C API is a already threat to *CPython*. (See what I wrote about all Py_UNICODE* APIs.)
Python 3.7 still has to support both the legacy Py_UNICODE* API and the new "compact string" API. It makes the CPython code base way more complex that it should be: any function accepting a string is supposed to call PyUnicode_Ready() and handle error properly. I would prefer to be able to remove the legacy PyUnicodeObject type, to only use compact strings everywhere.
Since we control all paths that can lead to the creation of a Unicode object, why not make sure we always call the internal equivalent to PyType_Ready() before even returning it to the user, and replace the public PyType_Ready() function by a no-op macro?
It's PyUnicode_Ready(), not PyType_Ready() :-)
No, we don't control anything: PyUnicode_FromStringAndSize(NULL, size) remains accessible in the C API and I'm sure that *many* C extensions use it. This function creates a string using Py_UNICODE* internally. There is also PyUnicode_FromUnicode() and many other functions. We only started to deprecated them in the documentation recently. They are only deprecated in the documentation, Py_DEPRECATED() is not used in C headers yet:
PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode( const Py_UNICODE *u, /* Unicode buffer */ Py_ssize_t size /* size of buffer */ ) /* Py_DEPRECATED(3.3) */;
That's why we must keep PyUnicode_Ready() in almost all functions accepting strings...
- PyUnicode_IS_COMPACT(): this API really rely on the *current* implementation
Is that function really used?
Used or not, it's currently part of the C API. I propose to kick it out of the public exported API, and really make it private. For example, just add a _ prefix: _PyUnicode_IS_COMPACT().
This function is heavily used internally in unicodeobject.c. I'm not sure why anyone would use it outside Python :-)
It's just one example of public API which should not be public.
- PyUnicode_READ()
That should be easy to emulate with a different internal representation, though. Just use an arbitrary value for "kind" and ignore it on the way in.
This function doesn't make sense on Python 3.2 and older, on PyPy, etc. It should just be made private.
That reminds me that the current C-API lacks a way to calculate the size of the character buffer in bytes. You have to use "length * kind", which really relies on internals then. There should be a macro for this.
I don't think that it's a good idea to provide such function, since the size of one character depends on the size of other characters in a string when using compact strings...
- Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail
But given that "Py_UNICODE" is already deprecated … :)
Well, "deprecated"...
Victor
Victor Stinner schrieb am 07.09.2018 um 14:39:
Le ven. 7 sept. 2018 à 14:25, Stefan Behnel a écrit :
That reminds me that the current C-API lacks a way to calculate the size of the character buffer in bytes. You have to use "length * kind", which really relies on internals then. There should be a macro for this.
I don't think that it's a good idea to provide such function, since the size of one character depends on the size of other characters in a string when using compact strings...
Yes, exactly. Users shouldn't have to compute the data size themselves. There should be a macro that returns it for a given Unicode string.
Stefan
Le 07/09/2018 à 14:39, Victor Stinner a écrit :
No, we don't control anything: PyUnicode_FromStringAndSize(NULL, size) remains accessible in the C API and I'm sure that *many* C extensions use it.
Why is PyUnicode_FromStringAndSize a problem?
Regards
Antoine.
Le ven. 7 sept. 2018 à 18:31, Antoine Pitrou <antoine@python.org> a écrit :
Why is PyUnicode_FromStringAndSize a problem?
Well, it's not a problem: it works and is supported :-) But this function creates a string in the legacy Py_UNICODE* format. Later, PyUnicode_Ready() must be called to convert it to a compact string.
It's just an example of API of technical debt and legacy API.
Victor
Le 07/09/2018 à 19:05, Victor Stinner a écrit :
Le ven. 7 sept. 2018 à 18:31, Antoine Pitrou <antoine@python.org> a écrit :
Why is PyUnicode_FromStringAndSize a problem?
Well, it's not a problem: it works and is supported :-) But this function creates a string in the legacy Py_UNICODE* format.
Why can't we simply change the implementation?
It's just an example of API of technical debt and legacy API.
The documentation for PyUnicode_FromStringAndSize() says:
""" Create a Unicode object from the char buffer u. The bytes will be interpreted as being UTF-8 encoded. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object, i.e. modification of the data is not allowed.
If u is NULL, this function behaves like PyUnicode_FromUnicode() with the buffer set to NULL. This usage is deprecated in favor of PyUnicode_New(). """
So there is a deprecated usage which can just be turned into an error. Otherwise I don't see the problem. PyUnicode_FromStringAndSize() exposes a reasonable feature and is necessary for many applications.
Regards
Antoine.
I was talking about the deprecated usage, with str=NULL ;-)
Sure, we can raise an error later.
Victor Le ven. 7 sept. 2018 à 19:09, Antoine Pitrou <antoine@python.org> a écrit :
Le 07/09/2018 à 19:05, Victor Stinner a écrit :
Le ven. 7 sept. 2018 à 18:31, Antoine Pitrou <antoine@python.org> a écrit :
Why is PyUnicode_FromStringAndSize a problem?
Well, it's not a problem: it works and is supported :-) But this function creates a string in the legacy Py_UNICODE* format.
Why can't we simply change the implementation?
It's just an example of API of technical debt and legacy API.
The documentation for PyUnicode_FromStringAndSize() says:
""" Create a Unicode object from the char buffer u. The bytes will be interpreted as being UTF-8 encoded. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object, i.e. modification of the data is not allowed.
If u is NULL, this function behaves like PyUnicode_FromUnicode() with the buffer set to NULL. This usage is deprecated in favor of PyUnicode_New(). """
So there is a deprecated usage which can just be turned into an error. Otherwise I don't see the problem. PyUnicode_FromStringAndSize() exposes a reasonable feature and is necessary for many applications.
Regards
Antoine.
capi-sig mailing list -- capi-sig@python.org To unsubscribe send an email to capi-sig-leave@python.org
I'm all for forbidding all those deprecated PyUnicode APIs.
Le 07/09/2018 à 19:11, Victor Stinner a écrit :
I was talking about the deprecated usage, with str=NULL ;-)
Sure, we can raise an error later.
Victor Le ven. 7 sept. 2018 à 19:09, Antoine Pitrou <antoine@python.org> a écrit :
Le 07/09/2018 à 19:05, Victor Stinner a écrit :
Le ven. 7 sept. 2018 à 18:31, Antoine Pitrou <antoine@python.org> a écrit :
Why is PyUnicode_FromStringAndSize a problem?
Well, it's not a problem: it works and is supported :-) But this function creates a string in the legacy Py_UNICODE* format.
Why can't we simply change the implementation?
It's just an example of API of technical debt and legacy API.
The documentation for PyUnicode_FromStringAndSize() says:
""" Create a Unicode object from the char buffer u. The bytes will be interpreted as being UTF-8 encoded. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object, i.e. modification of the data is not allowed.
If u is NULL, this function behaves like PyUnicode_FromUnicode() with the buffer set to NULL. This usage is deprecated in favor of PyUnicode_New(). """
So there is a deprecated usage which can just be turned into an error. Otherwise I don't see the problem. PyUnicode_FromStringAndSize() exposes a reasonable feature and is necessary for many applications.
Regards
Antoine.
capi-sig mailing list -- capi-sig@python.org To unsubscribe send an email to capi-sig-leave@python.org
participants (3)
-
Antoine Pitrou
-
Stefan Behnel
-
Victor Stinner