[Python-ideas] PEP: Hide implementation details in the C API

Nick Coghlan ncoghlan at gmail.com
Tue Jul 11 23:30:19 EDT 2017


Commenting more on specific technical details rather than just tone this time :)

On 11 July 2017 at 20:19, Victor Stinner <victor.stinner at gmail.com> wrote:
> PEP: xxx
> Title: Hide implementation details in the C API
> Version: $Revision$
> Last-Modified: $Date$
> Author: Victor Stinner <victor.stinner at gmail.com>,
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 31-May-2017
>
>
> Abstract
> ========
>
> Modify the C API to remove implementation details. Add an opt-in option
> to compile C extensions to get the old full API with implementation
> details.
>
> The modified C API allows to more easily experiment new optimizations:
>
> * Indirect Reference Counting
> * Remove Reference Counting, New Garbage Collector
> * Remove the GIL
> * Tagged pointers
>
> Reference counting may be emulated in a future implementation for
> backward compatibility.

I don't believe this is the best rationale to use for the PEP, as we
(or at least I) have emphatically promised *not* to do another Python
3 style compatibility break, and we know from PyPy's decade of
challenges that a lot of Python's users care even more about CPython C
API/ABI compatibility than they do the core data model.

It also has the downside of not really being true, since *other
implementations* are happily experimenting with alternative
approaches, and projects like PyMetabiosis attempt to use CPython
itself as an adapter between other runtimes and the full C API for
those extension modules that need it.

What is unequivocally true though is that in the current C API:

1. We're not sure which APIs other projects (including extension
module generators and helper libraries like Cython, Boost, PyCXX,
SWIG, cffi, etc) are *actually* relying on.
2. It's easy for us to accidentally expand the public C API without
thinking about it, since Py_BUILD_CORE guards are opt-in and
Py_LIMITED_API guards are opt-out
3. We haven't structured our header files in a way that makes it
obvious at a glance which API we're modifying (internal API, public
API, stable ABI)

> Rationale
> =========
>
> History of CPython forks
> ------------------------
>
> Last 10 years, CPython was forked multiple times to attempt
> different CPython enhancements:
>
> * Unladen Swallow: add a JIT compiler based on LLVM
> * Pyston: add a JIT compiler based on LLVM (CPython 2.7 fork)
> * Pyjion: add a JIT compiler based on Microsoft CLR
> * Gilectomy: remove the Global Interpreter Lock nicknamed "GIL"
> * etc.
>
> Sadly, none is this project has been merged back into CPython. Unladen
> Swallow looses its funding from Google, Pyston looses its funding from
> Dropbox, Pyjion is developed in the limited spare time of two Microsoft
> employees.
>
> One hard technically issue which blocked these projects to really
> unleash their power is the C API of CPython.

This is a somewhat misleadingly one-sided presentation of Python's
history, as the broad access to CPython internals offered by the C API
is precisely what *enabled* the scientific Python stack (including
NumPy, SciPy, Pandas, scikit-learn, Cython, Numba, PyCUDA, etc) to
develop largely independently of CPython itself.

So for folks that are willing to embrace the use of Cython (and
extension modules in general), many of CPython's runtime limitations
(like the GIL and the overheads of working with boxed values) can
already be avoided by pushing particular sections of code closer to C
semantics than they are to traditional Python semantics.

We've also been working to bring the runtime semantics of extension
modules ever closer to those of pure Python modules, to the point
where Python 3.7 is likely to be able to run an extension module as
__main__ (see https://www.python.org/dev/peps/pep-0547/ for details)

> Many old technical choices
> of CPython are hardcoded in this API:
>
> * reference counting
> * garbage collector
> * C structures like PyObject which contains headers for reference
>   counting and the garbage collector
> * specific memory allocators
> * etc.
>
> PyPy
> ----
>
> PyPy uses more efficient structures and use a more efficient garbage
> collector without reference counting. Thanks to that (but also many
> other optimizations), PyPy succeeded to run Python code up to 5x faster
> than CPython.

This framing makes it look a bit like you're saying "It's hard for
PyPy to correctly emulate these aspects of CPython, so we should
eliminate them as a barrier to adoption for PyPy by breaking them for
currently happy CPython's users as well". I don't think that's really
a framing you want to run with in the near term, as it's going to
start a needless fight, when there's plenty of unambiguously
beneficial work that coule be done before anyone starts contemplating
any kind of API compatibility break :)

In particular, better segmenting our APIs into "solely for CPython's
internal use", "ABI is specific to a CPython version", "API is
portable across Python implementations", "ABI is portable across
CPython versions (and maybe even Python implementations)" allows
tooling developers and extension module authors to make more informed
decisions about how closely they want to couple their work to CPython
specifically.

And then *after* we've done that API clarification work, *then* we can
ask the question about what the default behaviour of "#include
<Python.h>" should be, and perhaps introduce an opt-in Py_CPYTHON_API
flag to request access to the full traditional C API for extension
modules and embedding applications that actually need it. (While
that's still a compatibility break, it's one that can be trivially
resolved by putting an unconditional "#define Py_CPYTHON_API" before
the Python header inclusion for projects that find they were actually
relying on CPython specifics)

> Plan made of multiple small steps
> =================================
>
> Step 1: split Include/ into subdirectories
> ------------------------------------------
>
> Split the ``Include/`` directory of CPython:
>
> * ``python`` API: ``Include/Python.h`` remains the default C API
> * ``core`` API: ``Include/core/Python.h`` is a new C API designed for
>   building Python
> * ``stable`` API: ``Include/stable/Python.h`` is the stable ABI
>
> Expect declarations to be duplicated on purpose: ``#include`` should be
> not used to include files from a different API to prevent mistakes. In
> the past, too many functions were exposed *by mistake*, especially
> symbols exported to the stable ABI by mistake.
>
> At this point, ``Include/Python.h`` is not changed at all: zero risk of
> backward incompatibility.
>
> The ``core`` API is the most complete API exposing *all* implementation
> details and use macros for best performances.

This part I like, although as Eric noted, we can avoid making
wholesale changes to the headers of our implementation files by
putting a Py_BUILD_CORE guard around the inclusion of a
"Include/core/_CPython.h" header from "Include/Python.h"

> XXX should we abandon the stable ABI? Never really used by anyone.

It's also not available in Python 2.7, so anyone straddling the 2/3
boundary isn't currently able to rely on it.

As folks become more willing to drop Python 2.7 support, then
expending the effort to start targeting the stable ABI instead becomes
more attractive (especially for extension module creation tools like
Cython, cffi, and SWIG), since the stable ABI usage can *replace* the
code that uses the traditional CPython API.

> Step 2: Add an opt-in API option to tools building packages
> -----------------------------------------------------------
>
> Modify Python packaging tools (distutils, setuptools, flit, pip, etc.)
> to add an opt-in option to choose the API: ``python``, ``core`` or
> ``stable``.
>
> For example, debuggers like ``vmprof`` need need the ``core`` API to get
> a full access to implementation details.
>
> XXX handle backward compatibility for packaging tools.

For handcoded extensions, defining which API to use would be part of
the C/C++ code. For generated extensions, it would be an option passed
to Cython, cffi, etc.

Packaging frontends shouldn't need to explicitly support it any more
than they explicitly support the stable ABI today.

> Step 3: first pass of implementation detail removal
> ---------------------------------------------------
>
> Modify the ``python`` API:
>
> * Add a new ``API`` subdirectory in the Python source code which will
>   "implement" the Python C API
> * Replace macros with functions. The implementation of new functions
>   will be written in the ``API/`` directory. For example, Py_INCREF()
>   becomes the function ``void Py_INCREF(PyObject *op)`` and its
>   implementation will be written in the ``API`` directory.
> * Slowly remove more and more implementation details from this API.

I'd suggest doing this slightly differently by ensuring that the APIs
are defined as strict supersets of each other as follows:

1. CPython internal APIs (Py_BUILD_CORE)
2. CPython C API (status quo, currently no qualifier)
3. Portable Python API (new, starts as equivalent to stable ABI)
4. Stable Python ABI (Py_LIMITED_API)

The two new qualifiers would then be:

    #define Py_CPYTHON_API
    #define Py_PORTABLE_API

And Include/Python.h would end up looking something like this:

    [Common configuration includes would still go here]

    #ifdef Py_BUILD_CORE
      #include "core/_CPython.h"
    #else
      #ifdef Py_LIMITED_API
        #include "stable/Python.h"
      #else
        #ifdef Py_PORTABLE_API
          #include "portable/Python.h"
        #else
          #define Py_CPYTHON_API
          #include "cpython/Python.h"
        #endif
      #endif
    #endif

At some future date, the default could then potentially switch to
being the portable API for the current Python version, with folks
having to opt-in to using either the full CPython API or the portable
API for an older version.

To avoid having to duplicate prototype definitions, and to ensure that
C compilers complain when we inadvertently redefine a symbol
differently from the way a more restricted API defines it, each API
superset would start by including the next narrower API.

So we'd have this:

Include/stable/Python.h:

    [No special preamble, as it's the lowest common denominator API]

Include/portable/Python.h:

    #define Py_LIMITED_API Py_PORTABLE_API
    #include "../stable/Python.h"
    #undef Py_LIMITED_API
    [Any desired API additions and overrides]

Include/cpython/Python.h:

    #include "../patchlevel.h"
    #define Py_PORTABLE_API PY_VERSION_HEX
    #include "../portable/Python.h"
    #undef Py_PORTABLE_API
    [Include the rest of the current public C API]

Include/core/_CPython.h:

    #ifndef Py_BUILD_CORE
    #error "Internal headers are only available when building CPython"
    #endif
    #include "../cpython/Python.h"
    [Include the rest of the internal C API]

And at least initially, the subdirectories would be mostly empty -
instead, we'd have the following setup:

1. Unported headers would remain directly in "Include/" and be
included from "Include/Python.h"
2. Ported headers would have their contents split between core,
cpython, and stable based on their #ifdef chains
3. When porting, the more expansive APIs would use "#undef" as needed
when overriding a symbol deliberately

And then, once all the APIs had been clearly categorised in a way that
C compilers can better help us manage, the folks that were interested
in this could start building key extension modules (such as NumPy and
lxml) using "Py_PORTABLE_API=0x03070000", and *adding* to the portable
API on an explicitly needs-driven basis.

Cheers,
NIck.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-ideas mailing list