[capi-sig]Re: PyUnicode C API

7 Sep 2018 · *current*

      Victor Stinner schrieb am 07.09.2018 um 11:57:
...
On 07.09.2018 10:22, Victor Stinner wrote:
...
...
I'm in discussion with PyPy developers, and they reported different
APIs which cause them troubles:
(...)

almost all PyUnicode API functions have to go according to them.
PyPy3 uses UTF-8 internally

Well, it's not like anyone asked them to do that. ;)
...
Python 3.7 still has to support both the legacy Py_UNICODE* API and
the new "compact string" API. It makes the CPython code base way more
complex that it should be: any function accepting a string is supposed
to call PyUnicode_Ready() and handle error properly. I would prefer to
be able to remove the legacy PyUnicodeObject type, to only use compact
strings everywhere.
Since we control all paths that can lead to the creation of a Unicode
object, why not make sure we always call the internal equivalent to
PyType_Ready() before even returning it to the user, and replace the public
PyType_Ready() function by a no-op macro?
It would probably slow things down on Windows (where Py_UNICODE is still a
thing) in certain scenarios, specifically, when passing Unicode strings in
and out of the Windows API, possibly including to print them. But then, how
performance critical are these scenarios, really? Large data sets would
almost always come from some kind of byte encoded device, be it a local
file or a network connection, or leave the system as a byte encoded sequence.
So, I think this is a point where we can reconsider the original design.
We could also keep the current API on Windows and only use a constant
PyType_Ready() on unixy systems, although that would obviously mean that
people still have to call it (in order to support Windows) and we can't
just remove it at some point. And it would be more difficult to detect bugs
when users forget to call it.
...
Let me elaborate what are good and bad functions for PyUnicode.
Example of bad APIs:
...

PyUnicode_IS_COMPACT(): this API really rely on the *current*
implementation

Is that function really used?
As a quick and incomplete indication, github seems to only find forks of
CPython and cpyext that contain it, at least from a quick jump through the
tons of identical results.
https://github.com/search?q=pyunicode_is_compact&type=Code
...

PyUnicode_READ()

That should be easy to emulate with a different internal representation,
though. Just use an arbitrary value for "kind" and ignore it on the way in.
That reminds me that the current C-API lacks a way to calculate the size of
the character buffer in bytes. You have to use "length * kind", which
really relies on internals then. There should be a macro for this.
...

Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail

But given that "Py_UNICODE" is already deprecated … :)
...
Good API:

PyUnicode_Concat(): C API for str + str
PyUnicode_Split()
PyUnicode_FindChar()

Border line:

PyUnicode_IS_ASCII(op): it's a O(1) operation on CPython, but it can
O(n) on other implementations (like PyPy which uses UTF-8). But we
also added str.isascii() in Python 3.7....

It's not like anyone asked PyPy to use UTF-8 internally. ;)
But even then, keeping a set of string property flags around really can't
be that expensive, and might easily pay off in encoder implementations,
e.g. by knowing the target buffer size upfront. ASCII-only strings are
still excessively common.
...

PyUnicode_READ_CHAR()

I was going to write "is anyone actually using that, given how expensive it
is?", but then realised that we're using it in Cython to implement Unicode
string indexing. :) That's probably also its only use case.
...

PyUnicode_CompareWithASCIIString(): the function name announces
ASCII but decodes the byte string from Latin1 :-)

Which makes total sense, because, why wouldn't it? But the name is
obviously … underadapted.
Stefan

[capi-sig]Re: PyUnicode C API

Stefan Behnel