Victor Stinner schrieb am 07.09.2018 um 11:57:
On 07.09.2018 10:22, Victor Stinner wrote:
I'm in discussion with PyPy developers, and they reported different APIs which cause them troubles: (...)
- almost all PyUnicode API functions have to go according to them.
PyPy3 uses UTF-8 internally
Well, it's not like anyone asked them to do that. ;)
Python 3.7 still has to support both the legacy Py_UNICODE* API and the new "compact string" API. It makes the CPython code base way more complex that it should be: any function accepting a string is supposed to call PyUnicode_Ready() and handle error properly. I would prefer to be able to remove the legacy PyUnicodeObject type, to only use compact strings everywhere.
Since we control all paths that can lead to the creation of a Unicode object, why not make sure we always call the internal equivalent to PyType_Ready() before even returning it to the user, and replace the public PyType_Ready() function by a no-op macro?
It would probably slow things down on Windows (where Py_UNICODE is still a thing) in certain scenarios, specifically, when passing Unicode strings in and out of the Windows API, possibly including to print them. But then, how performance critical are these scenarios, really? Large data sets would almost always come from some kind of byte encoded device, be it a local file or a network connection, or leave the system as a byte encoded sequence.
So, I think this is a point where we can reconsider the original design.
We could also keep the current API on Windows and only use a constant PyType_Ready() on unixy systems, although that would obviously mean that people still have to call it (in order to support Windows) and we can't just remove it at some point. And it would be more difficult to detect bugs when users forget to call it.
Let me elaborate what are good and bad functions for PyUnicode.
Example of bad APIs:
- PyUnicode_IS_COMPACT(): this API really rely on the *current*
Is that function really used?
As a quick and incomplete indication, github seems to only find forks of CPython and cpyext that contain it, at least from a quick jump through the tons of identical results.
That should be easy to emulate with a different internal representation, though. Just use an arbitrary value for "kind" and ignore it on the way in.
That reminds me that the current C-API lacks a way to calculate the size of the character buffer in bytes. You have to use "length * kind", which really relies on internals then. There should be a macro for this.
- Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail
But given that "Py_UNICODE" is already deprecated … :)
- PyUnicode_Concat(): C API for str + str
- PyUnicode_IS_ASCII(op): it's a O(1) operation on CPython, but it can
O(n) on other implementations (like PyPy which uses UTF-8). But we also added str.isascii() in Python 3.7....
It's not like anyone asked PyPy to use UTF-8 internally. ;) But even then, keeping a set of string property flags around really can't be that expensive, and might easily pay off in encoder implementations, e.g. by knowing the target buffer size upfront. ASCII-only strings are still excessively common.
I was going to write "is anyone actually using that, given how expensive it is?", but then realised that we're using it in Cython to implement Unicode string indexing. :) That's probably also its only use case.
- PyUnicode_CompareWithASCIIString(): the function name announces
ASCII but decodes the byte string from Latin1 :-)
Which makes total sense, because, why wouldn't it? But the name is obviously … underadapted.