Hi Stefan,
Thanks for your email, you asked many good questions :-) It seems like my documentation is incomplete, especially the rationale part. It's fine, I can complete it later. In the meanwhile, here are my answers inline.
2018-07-29 23:40 GMT+02:00 Stefan Behnel <python_capi@behnel.de>:
From Cython's POV, exposing internals is a good thing that helps making extension modules faster.
I'm fine with Cython wanting to pay the burden of following C API changes to get best performance. But I would only allow Cython (and cffi) to use it, not all C extensions ;-)
Technically, I plan to keep access to the full API giving access to C structures and all low-level stuff, for specific use cases like Cython and debug tools. But at the end of my roadmap, it will be an opt-in option rather than the default.
Hopefully Cython exists to hide the ugly C API ;-)
*Hiding* internals would already break code, so I don't see the advantage over just *changing* the internals instead, but continuing to expose those new internals.
My main motivation to change the C API is to allow to change CPython. For example, I would like to experiment a specialized implementation of the list type which would store small integers as a C array of int8_t, int16_t or int32_t to be more space efficient (I'm not sure that it would be faster, I'm not really interested have SIMD-like operations in the stdlib). Currently, PySequence_Fast_ITEMS() and exposing the PyListObject structure prevent to experiment such optimization, because PyListObject.ob_items leaks PyObject**. To be honest, I'm not sure that this specific optimization is worth it, but I like to give this example since it's easy to explain it.
The problem we have is the heap of C extensions that are no longer (actively) maintained, not those that are maintained but use internals.
This is why my project has a "backward compatibility" page: https://pythoncapi.readthedocs.io/backward_compatibility.html
I would like to *remove* PyDict_GetItem(), but maybe we can provide a 3rd party C library which would reimplement PyDict_GetItem() on top of the new PyDict_GetItemRef() function which returns a strong reference.
Currently, the page only explains the other side: be able to modify C extensions to use the new name (but a macro or something else will fallback on PyDict_GetItem() on Python 3.7 and older).
Basically, PyPy shows that, given enough developer time, the C-API can be emulated even based on a very different runtime design, potentially with a handicap in terms of performance. If some parts of the C-API are to be replaced, that might be a way to go.
"potentially with a handicap in terms of performance"
Multiple PyPy developers told me that cpyext remains one of the most important blocker issue to move an application away from CPython. I wouldn't say that it's a solved issue.
Moreover, I'm not sure that optimizing cpyext is the favorite task of PyPy developers. They are likely other parts of PyPy which would deserve more love than cpyext, no? :-)
But I'm just guessing here, I would prefer to hear directly from PyPy developers ;-)
I also don't buy the argument that binary modules built for, say, Py3.6 must continue to import on Py3.9, for example. Supporting the last couple of supported releases with binary wheels has proven good enough IMHO, and rebuilding for a new CPython release seems acceptable, given that this also enables the use of new features. (Would be something to ask distributors, though.)
I created the pythoncapi project between two flights, so sorry, my rationale is still maybe incomplete :-)
From the point of view of Red Hat, a Linux vendor, having to support multiple Python versions is a pain, especially for QA testing. Currently, the compromise is to only provide one Python version per OS release. For example, Fedora 28 only supports Python 3.6 even if Python 3.7 has been released during Fedora 28 lifetime. For Fedora, in practice, it's fine, since they are release every 6 months. Ubuntu LTS is supported for 5 years, having an old Python version can be more annoying. And then there is RHEL which is supported for 10 years (up to 15 years for extended support). On that scale, Python release schedule doesn't fit well with RHEL support.
By "supported Python version", I not only mean the /usr/bin/pythonX.Y binary, but also packages for dozens of Python modules. Fedora 28 provides Python binaries for various Python versions (2.7, 2.7, 3.4, 3.5, 3.6, 3.7 if I recall correctly), but it has only python3-* modules for Python 3.6.
Supporting 2 Python versions, like 3.6 and 3.7, means to double the size of the repository, but also double the tests for tha QA team (each time a new package version is released, usually for bugfixes). What if you want to support 3 Python versions in parallel, if not more?
... in the meanwhile, macOS is stuck at Python 2.7 :-) macOS users: how much do like Python 2.7 in 2018?
This is one issue.
Another issue is the Python binary compiled in debug mode, known as python-dbg (or python-debug or python-debuginfo). Right now, it's mostly useless since Linux distributions don't provide two flavors of Python modules (release and debug modes): you have to recompile manually in debug mode all your C extensions used by your application. Good luck with installing build dependencies and handling compilation errors. Because of that, nobody uses the debug build, whereas it's super useful to debug C extensions. As a consequence, we (Python upstream, but also Linux vendors) get bug reports where a C extension crashed and we are unable to debug it (oh, gc.collect() crashed on an invalid object, deal with that!).
Moreover, right now, it's unclear if the C API is designed for CPython internals or to be used by third party, if it should check all arguments or not. Some functions check a few arguments, some others don't. For the functions which check arguments: you get a slowdown, even if your full application is using properly the C API. It's like running a kind of debug build in production. Would you deploy a C program compiled with assertions in production once you checked that your application is bugfix? Why should we have to pay the price of this "debug mode" in the Python compiled in "release mode"?
I would like to be able to remove most debug checks from a *release* build, but also be able to run C extensions with a *different runtime* which would be Python compiled in debug mode.
So, from my POV, I'd vote for
- allowing C-API changes in each X.Y release
Which kind of changes do you want to do?
- requiring a new binary wheel (or rebuild) for each X.Y release
It doesn't solve the issue of being stuck to one Python version per OS release.
- providing a compatibility layer for "removed" C-API functionality
Above, I proposed to require a *library* for that. But you would only be able to use such library with a Python runtime which remains fully compatible with Python 3.7. No specialized list for you in that case! That's the price of backward compatibility.
This is also where I would like to allow to have multiple Python "runtimes" per Python version:
- CPython compiled in release mode with backward compatibility: "python3"
- CPython compiled in debug mode "python3-dbg"
- experimental CPython, maybe faster: "experimental_python3", for example with specialized list so incompatible with PyDict_GetItem() and borrowed references
Technically, in CPython, it can be 3 different compilation modes of the same code base.
But I also would like to let people do their own experiment with their own CPython forks, again, without losing support for C extensions!
- exposing any internals that may help extension modules
In my current roadmap, there is: "Step 4: if step 3 gone fine and most people are still ok to continue, make the new C API as the default in CPython and add an option for opt-out."
The "opt-out" option is the existing API which leaks all implementation details.
- maybe add a warning to the docs of exposed internals that these are more likely to change than other parts of the C-API
Yes, we have to work on the C API documentation of CPython. Right now, I'm more at the first step on my roadmap:
"Step 1: Identify Bad C API and list functions that should be modified or even removed"
A next step would be to start to document which APIs are "bad" in this CPython documentation. Maybe start by adding a something like "provisonal deprecation warning", but only in the documentation. Or a real deprecation, but only in the doc, if we succeed to agree on APIs that should go away.
I'd also suggest to make Cython, pybind11 and cffi (maybe a few more) the preferred and official ways to extend and integrate with CPython, to keep those three up to date with all C-API changes, and to make it as easy as possible for users to build their code with them against new CPython releases.
I'm now really worried about new C extensions which already use "modern" solutions like Cython and cffi.
My concern is the very long tail of C extensions which call directly the C API. I'm sure that we can enhance the C API somehow without breaking this long tail.
If you want a more radical proposal, I'd deprecate the C-API documentation, push people into not caring about the C-API themselves, and then concentrate on keeping the major code integration tools out there compatible and fast with whatever CPython can provide as "exposed internals".
Honestly, at this point, I'm open to any idea! But I'm not ok to "break the world". This plan is not going to work. Even if PyPy is promoting cffi for years, the C API remains very popular and commonly used.
I'm not sure that deprecating the API or the documentation would help. In 2018, ten years after Python 3.0 has been released, we are still discussing how to migrate old code base away from Python 2, even if they are many tools doing "most" of the migration. I'm not even aware of tools to rewrite a C extension using Cython or cffi. If it exists, why would anyone take the risk of a regression since C extensions are currently working perfectly on CPython?
My problem is to find a solution to change the C API without forcing C extension authors to change their code "too much", maybe using a new compatibility layers.
Victor