[Numpy-discussion] draft NEP for breaking ufunc ABI in a controlled way

Jaime Fernández del Río jaime.frio at gmail.com
Mon Sep 21 10:29:08 EDT 2015


We have the PyArrayObject vs PyArrayObject_fields definition in
ndarraytypes.h that is used to enforce access to the members through inline
functions rather than directly,  which seems to me like the right way to
go: don't leave stones unturned, hide everything and provide PyUFunc_NIN,
PyUFunc_NOUT and friends to handle those too.

On Sun, Sep 20, 2015 at 9:13 PM, Nathaniel Smith <njs at pobox.com> wrote:

> Hi all,
>
> Here's a first draft NEP for comments.
>
> --
>
> Synopsis
> ========
>
> Improving numpy's dtype system requires that ufunc loops start having
> access to details of the specific dtype instance they are acting on:
> e.g. an implementation of np.equal for strings needs access to the
> dtype object in order to know what "n" to pass to strncmp. Similar
> issues arise with variable length strings, missing values, categorical
> data, unit support, datetime with timezone support, etc. -- this is a
> major blocker for improving numpy.
>
> Unfortunately, the current ufunc inner loop function signature makes
> it very difficult to provide this information. We might be able to
> wedge it in there, but it'd be ugly.
>
> The other option would be to change the signature. What would happen
> if we did this? For most common uses of the C API/ABI, we could do
> this easily while maintaining backwards compatibility. But there are
> also some rarely-used parts  of the API/ABI that would be
> prohibitively difficult to preserve.
>
> In addition, there are other potential changes to ufuncs on the
> horizon (e.g. extensions of gufuncs to allow them to be used more
> generally), and the current API exposure is so massive that any such
> changes will be difficult to make in a fully compatible way. This NEP
> thus considers the possibility of closing down the ufunc API to a
> minimal, maintainable subset of the current API.
>
> To better understand the consequences of this potential change, I
> performed an exhaustive analysis of all the code on Github, Bitbucket,
> and Fedora, among others. The results make me highly confident that of
> all the publically available projects in the world, the only ones
> which touch the problematic parts of the ufunc API are: Numba,
> dynd-python, and `gulinalg <https://github.com/ContinuumIO/gulinalg>`_
> (with the latter's exposure being trivial).
>
> Given this, I propose that for 1.11 we:
> 1) go ahead and hide/disable the problematic parts of the ABI/API,
> 2) coordinate with the known affected projects to minimize disruption
> to their users (which is made easier since they are all projects that
> are almost exclusively distributed via conda, which enforces strict
> NumPy ABI versioning),
> 3) publicize these changes widely so as to give any private code that
> might be affected a chance to speak up or adapt, and
> 4) leave the "ABI version tag" as it is, so as not to force rebuilds
> of the vast majority of projects that will be unaffected by these
> changes.
>
> This NEP defers the question of exactly what the improved API should
> be, since there's no point in trying to nail down the details until
> we've decided whether it's even possible to change.
>
>
> Details
> =======
>
> The problem
> -----------
>
> Currently, a ufunc inner loop implementation is called via the
> following function prototype::
>
>     typedef void (*PyUFuncGenericFunction)
>                 (char **args,
>                  npy_intp *dimensions,
>                  npy_intp *strides,
>                  void *innerloopdata);
>
> Here ``args`` is an array of pointers to 1-d buffers of input/output
> data, ``dimensions`` is a pointer to the number of entries in these
> buffers, ``strides`` is an array of integers giving the strides for
> each input/output array, and ``innerloopdata`` is an arbitrary void*
> supplied by whoever registered the ufunc loop. (For gufuncs, extra
> shape and stride information about the core dimensions also gets
> packed into the ends of these arrays in a somewhat complicated way.)
>
> There are 4 key items that define a NumPy array: data, shape, strides,
> dtype. Notice that this function only gets access to 3 of them. Our
> goal is to fix that. For example, a better signature would be::
>
>     typedef void (*PyUFuncGenericFunction_NEW)
>                 (char **data,
>                  npy_intp *shapes,
>                  npy_intp *strides,
>                  PyArray_Descr *dtypes,   /* NEW */
>                  void *innerloopdata);
>
> (In practice I suspect we might want to make some more changes as
> well, like upgrading gufunc core shape/strides to proper arguments
> instead of tacking it onto the existing arrays, and adding an "escape
> valve" void* reserved for future extensions. But working out such
> details is outside the scope of this NEP; the above will do for
> illustration.)
>
> The goal of this NEP is to clear the ground so that we can start
> supporting ufunc inner loops that take dtype arguments, and make other
> enhancements to ufunc functionality going forward.
>
>
> Proposal
> --------
>
> Currently, the public API/ABI for ufuncs consists of the functions::
>
>     PyUFunc_GenericFunction
>
>     PyUFunc_FromFuncAndData
>     PyUFunc_FromFuncAndDataAndSignature
>     PyUFunc_RegisterLoopForDescr
>     PyUFunc_RegisterLoopForType
>
>     PyUFunc_ReplaceLoopBySignature
>     PyUFunc_SetUsesArraysAsData
>
> together with direct access to PyUFuncObject's internal fields::
>
>     typedef struct {
>         PyObject_HEAD
>         int nin, nout, nargs;
>         int identity;
>         PyUFuncGenericFunction *functions;
>         void **data;
>         int ntypes;
>         int check_return;
>         const char *name;
>         char *types;
>         const char *doc;
>         void *ptr;
>         PyObject *obj;
>         PyObject *userloops;
>         int core_enabled;
>         int core_num_dim_ix;
>         int *core_num_dims;
>         int *core_dim_ixs;
>         int *core_offsets;
>         char *core_signature;
>         PyUFunc_TypeResolutionFunc *type_resolver;
>         PyUFunc_LegacyInnerLoopSelectionFunc *legacy_inner_loop_selector;
>         PyUFunc_InnerLoopSelectionFunc *inner_loop_selector;
>         PyUFunc_MaskedInnerLoopSelectionFunc *masked_inner_loop_selector;
>         npy_uint32 *op_flags;
>         npy_uint32 iter_flags;
>     } PyUFuncObject;
>
> Obviously almost any future changes to how ufuncs work internally will
> involve touching some part of this public API/ABI.
>
> Concretely, the proposal here is that we avoid this by disabling the
> following functions (i.e., any attempt to call them should simply
> raise a ``NotImplementedError``)::
>
>     PyUFunc_ReplaceLoopBySignature
>     PyUFunc_SetUsesArraysAsData
>
> and that we reduce the publicly visible portion of PyUFuncObject down to::
>
>     typedef struct {
>         PyObject_HEAD
>         int nin, nout, nargs;
>     } PyUFuncObject;
>
>
> Data on current API/ABI usage
> -----------------------------
>
> In order to assess how much code would be affected by this proposal, I
> used a combination of Github search and Searchcode.com to trawl
> through the majority of all publicly available open source code.
> Neither search tool provides a fine-grained enough query language to
> directly tell us what we want to know, so I instead followed the
> strategy of first, casting a wide net: picking a set of search terms
> that are likely to catch all possibly-broken code (together with many
> false positives), and second, using automated tools to sift out the
> false positives and see what remained. Altogether, I reviewed 4464
> search results.
>
> The tool I wrote to do this is `available on github
> <https://github.com/njsmith/codetrawl>`_, and so is `the analysis code
> itself <https://github.com/njsmith/ufunc-abi-analysis>`_.
>
>
> Uses of PyUFuncObject internals
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> There are no functions in the public API which return
> ``PyUFuncObject*`` values directly, so any code that access
> PyUFuncObject fields will have to mention that token in the course of
> defining a variable, performing a cast, setting up a typedef, etc.
>
> Therefore, I searched Github for all files written in C, C++,
> Objective C, Python, or Cython, which mentioned either "PyUFuncObject
> AND types" or "PyUFuncObject AND NOT types". (This is to work around
> limitations on how many results Github search is willing to return to
> a single query.) In addition, I searched for ``PyUFuncObject`` on
> searchcode.com.
>
> The full report on these searches is available here:
>
> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html
>
> The following were screened out as non-problems:
>
> - Copies of NumPy itself (an astonishing number of people have checked
> in copies of it to their own source tree)
> - NumPy forks / precursors / etc. (e.g. Numeric also had a type called
> PyUFuncObject, the "bohrium" project has a fork of numpy 1.6, etc.)
> - Cython-generated boilerplate used to generate the "object has
> changed size" warning (which we `unconditionally filter out anyway
> <https://github.com/numpy/numpy/blob/master/numpy/__init__.py#L226>`_)
> - Lots of calls to ``PyUFunc_RegisterLoopForType`` and friends, which
> require casting the first argument to ``PyUFuncObject*``
> - Misc. other unproblematic stuff (like Cython header declarations
> that never get used)
>
> There were also several cases that actually referenced PyUFuncObject
> internal fields:
>
> - The "rational" dtype from numpy-dtypes, which is used in a few
> projects, accesses ``ufunc->nargs`` as a safety check, but does not
> touch any other fields (`see here
> <
> https://github.com/numpy/numpy-dtypes/blob/c0175a6b1c5aa89b4520b29487f06d0e200e2a03/npytypes/rational/rational.c#L1140-L1151
> >`_).
>
> - Numba: does some rather elaborate things to support the definition
> of on-the-fly JITted ufuncs. These seem to be clear deficiencies in
> the ufunc API (e.g., there's no good way to control the lifespan of
> the array of function pointers passed to ``PyUFunc_FromFuncAndData``),
> so we should work with them to provide the API they need to do this in
> a maintainable way. Some of the relevant code:
>
>   https://github.com/numba/numba/tree/master/numba/npyufunc
>
> https://github.com/numba/numba/blob/98752647a95ac6c9d480e81ca5c8afcfa3ddfd18/numba/npyufunc/_internal.c
>
> - dynd-python: Contains some code that attempts to extract the inner
> loops from a numpy ufunc object and wrap them into the dynd 'ufunc'
> equivalent:
>
> https://github.com/libdynd/dynd-python/blob/c06f8fc4e72257abac589faf76f10df8c045159b/dynd/src/numpy_ufunc_kernel.cpp
>
> - gulinalg: I'm not sure if anyone is still using this code since most
> of it was merged into numpy itself, but it's not a big deal in any
> case: all it contains is a `debugging function
> <
> https://github.com/ContinuumIO/gulinalg/blob/2ef365c48427c026dab4f45dc6f8b1b9af184460/gulinalg/src/gulinalg.c.src#L527-L550
> >`_
> that dumps some internal fields from the PyUFuncObject. If you look,
> though, all calls to this function are already commented out :-).
>
> The full report is available here:
>
> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html
>
> In the course of this analysis, it was also noted that the standard
> Cython pxd files contain a wrapper for ufunc objects::
>
>     cdef class ufunc [object PyUFuncObject]:
>         ...
>
> which means that Cython code can access internal struct fields via an
> object of type ``ufunc``, and thus escape our string-based search
> above. Therefore I also examined all Cython files on Github or
> searchcode.com that matched the query ``ufunc``, and searched for any
> lines matching any of the following regular expressions::
>
>     cdef\W+ufunc
>        catches: 'cdef ufunc fn'
>     cdef\W+.*\.\W*ufunc
>        catches: 'cdef np.ufunc fn'
>     <.*ufunc\W*>
>        catches: '(<ufunc> fn).nargs', '(< np.ufunc > fn).nargs'
>     cdef.*\(.*ufunc
>        catches: 'cdef doit(np.ufunc fn, ...):'
>
> (I considered parsing the actual source and analysing it that way, but
> decided I was too lazy. This could be done if anyone is worried that
> the above regexes might miss things though.)
>
> There were zero files that contained matches for any of the above regexes:
>
> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/ufunc-cython-report.html
>
>
> Uses of PyUFunc_ReplaceLoopBySignature
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Applying the same screening as above, the only code that was found
> that used this function is also in Numba:
>
> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_ReplaceLoopBySignature-report.html
>
>
> Uses of PyUFunc_SetUsesArraysAsData
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Aside from being semi-broken since 1.7 (it never got implemented for
> "masked" ufunc loops, i.e. those that use where=), there appear to be
> zero uses of this functionality either inside or outside NumPy:
>
> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_SetUsesArraysAsData-report.html
>
>
> Rationale
> ---------
>
> **Rationale for preserving the remaining API functions**::
>
>     PyUFunc_GenericFunction
>
>     PyUFunc_FromFuncAndData
>     PyUFunc_FromFuncAndDataAndSignature
>     PyUFunc_RegisterLoopForDescr
>     PyUFunc_RegisterLoopForType
>
> In addition to being widely used, these functions can easily be
> preserved even if we change how ufuncs work internally, because they
> only ingest loop function pointers, they never return them. So they
> can easily be modified to wrap whatever loop function(s) they receive
> inside an adapter function that calls them at the appropriate time,
> and then register that adapter function using whatever API we add in
> the future.
>
> **Rationale for preserving the particular fields that are preserved**:
> Preserving ``nargs`` lets us avoid a little bit of breakage with the
> random dtype, and it doesn't seem like preserving ``nin``, ``nout``,
> ``nargs`` fields will produce any undue burden on future changes to
> ufunc internals; even if we were to introduce variadic ufuncs we could
> always just stick a -1 in the appropriate fields or whatever.
>
> **Rationale for removing PyUFunc_ReplaceLoopBySignature**: this
> function *returns* the PyUFunc_GenericFunction that was replaced; if
> we stop representing all loops using the legacy
> PyUFunc_GenericFunction type, then this will not be possible to do in
> the future.
>
> **Rationale for removing PyUFunc_SetUsesArraysAsData**: If set as the
> ``innerloopdata`` on a ufunc loop, then this function acts as a
> sentinel value, and causes the ``innerloopdata`` to instead be set to
> a pointer to the passed-in PyArrayObjects.  In principle we could
> preserve this function, but it has a number of deficiencies:
> - No-one appears to use it.
> - It's been buggy for several releases and no-one noticed.
> - AFAIK the only reason it was included in the first place is that it
> provides a backdoor for ufunc loops to get access to the dtypes -- but
> we are planning to fix this in a better way.
> - It can't be shimmed as easily as the loop registration functions,
> because we don't anticipate that the new-and-improved ufunc loop
> functions will *get* access to the array objects, only to the dtypes;
> so this would have to remain cluttering up the core dispatch path
> indefinitely.
> - We have good reason for *not* wanting to get ufunc loops get access
> to the actual array objects, because one of the goals on our roadmap
> is exactly to enable the use of ufuncs on non-ndarray objects. Giving
> ufuncs access to dtypes alone creates a clean boundary here: it
> guarantees that ufunc loops can work equally on all duck-array objects
> (so long as they have a dtype), and enforces the invariant that
> anything which affects the interpretation of data values should be
> attached to the dtype, not to the array object.
>
>
> Rejected alternatives
> ---------------------
>
> **Do nothing**: there's no way we'll ever be able to touch ufuncs at
> all if we don't hide those fields sooner or later. While any amount of
> breakage is regrettable, the costs of cleaning things up now are less
> than the costs of never improving numpy's APIs.
>
> **Somehow sneak the dtype information in via ``void
> *innerloopdata``**: This might let us preserve the signature of
> PyUFunc_GenericFunction, and thus preserve
> PyUFunc_ReplaceLoopBySignature. But we'd still have the problem of
> leaving way too much internal state exposed, and it's not even clear
> how this would work, given that we actually do want to preserve the
> use of ``innerloopdata`` for actual per-loop data. (This is where the
> PyUFunc_SetUsesArraysAsData hack fails.)
>
>
> --
> Nathaniel J. Smith -- http://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>



-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150921/3511514a/attachment.html>


More information about the NumPy-Discussion mailing list