[Numpy-discussion] draft NEP for breaking ufunc ABI in a controlled way

Bryan Van de Ven bryanv at continuum.io
Tue Sep 22 00:47:45 EDT 2015


> 
> until then our only real options are either hard breaks or nothing, so
> unless we want to do a hard break there's not much point talking about
> it.

I think this is the most important sentence from this thread. Thank you Nathaniel for you extremely thorough analysis of the impact on real-world projects. 

Bryan 


> On Sep 22, 2015, at 12:23 AM, Nathaniel Smith <njs at pobox.com> wrote:
> 
> On Mon, Sep 21, 2015 at 7:29 AM, Jaime Fernández del Río
> <jaime.frio at gmail.com> wrote:
>> We have the PyArrayObject vs PyArrayObject_fields definition in
>> ndarraytypes.h that is used to enforce access to the members through inline
>> functions rather than directly,  which seems to me like the right way to go:
>> don't leave stones unturned, hide everything and provide PyUFunc_NIN,
>> PyUFunc_NOUT and friends to handle those too.
> 
> The PyArrayObject vs PyArrayObject_fields distinction is only enabled
> if a downstream library explicitly requests it with #define
> NPY_NO_DEPRECATED_API, though -- the idea is that the changes in this
> NEP would be enabled unconditionally, even for old code. So the reason
> nin/nout/nargs are left exposed in this proposal is that there's some
> existing code out there that would break (until updated) if we hid
> them, and not much benefit to breaking it.
> 
> If we're fine with breaking that code then we could just hide them
> unconditionally too. The only code I found in the wild that would be
> affected is the "rational" user-defined dtype, which would be
> trivially fixable since the only thing it does with ufunc->nargs is a
> quick consistency check:
> 
>  https://github.com/numpy/numpy-dtypes/blob/c0175a6b1c5aa89b4520b29487f06d0e200e2a03/npytypes/rational/rational.c#L1140-L1151
> 
> Also it's not 100% clear right now whether we even want to keep
> supporting the old user-defined dtype API that this particular code is
> based around. But if this code uses ufunc->nargs then perhaps other
> code does too? I'm open to opinions -- I doubt it matters that much
> either way. I just want to make sure that we can hide the other stuff
> :-).
> 
> When it comes to evolving these APIs in general: one unfortunate thing
> about the PyArrayObject changes in 1.7 is that because they were
> implemented using *inline* functions (/macros) they haven't affected
> the a*B*i exposure at all, even in code that has upgraded to the new
> calling conventions. While user code no longer *names* the internal
> fields directly, we still have to implement exactly the same fields
> and put them in exactly the same place in memory or else break ABI.
> And the other unfortunate thing is that we don't really have a
> mechanism for saying "okay, we're dropping support for the old way of
> doing things in 1.xx" -- in particular the current
> NPY_NO_DEPRECATED_API mechanism doesn't give us any way to detect and
> error out if someone tries to use an old version of the APIs, so ABI
> breaks still mean segfaults. I'm thinking that if/when we figure out
> how to implement the "sliding window" API/ABI idea that we talked
> about at SciPy, then that will give us a strategy for cleanly
> transitioning to a world with a maintainable API+ABI and it becomes
> worth sitting down and making up a set of setters/getters for the
> attributes that we want to make public in a maintainable way. But
> until then our only real options are either hard breaks or nothing, so
> unless we want to do a hard break there's not much point talking about
> it.
> 
> -n
> 
>> On Sun, Sep 20, 2015 at 9:13 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> Here's a first draft NEP for comments.
>>> 
>>> --
>>> 
>>> Synopsis
>>> ========
>>> 
>>> Improving numpy's dtype system requires that ufunc loops start having
>>> access to details of the specific dtype instance they are acting on:
>>> e.g. an implementation of np.equal for strings needs access to the
>>> dtype object in order to know what "n" to pass to strncmp. Similar
>>> issues arise with variable length strings, missing values, categorical
>>> data, unit support, datetime with timezone support, etc. -- this is a
>>> major blocker for improving numpy.
>>> 
>>> Unfortunately, the current ufunc inner loop function signature makes
>>> it very difficult to provide this information. We might be able to
>>> wedge it in there, but it'd be ugly.
>>> 
>>> The other option would be to change the signature. What would happen
>>> if we did this? For most common uses of the C API/ABI, we could do
>>> this easily while maintaining backwards compatibility. But there are
>>> also some rarely-used parts  of the API/ABI that would be
>>> prohibitively difficult to preserve.
>>> 
>>> In addition, there are other potential changes to ufuncs on the
>>> horizon (e.g. extensions of gufuncs to allow them to be used more
>>> generally), and the current API exposure is so massive that any such
>>> changes will be difficult to make in a fully compatible way. This NEP
>>> thus considers the possibility of closing down the ufunc API to a
>>> minimal, maintainable subset of the current API.
>>> 
>>> To better understand the consequences of this potential change, I
>>> performed an exhaustive analysis of all the code on Github, Bitbucket,
>>> and Fedora, among others. The results make me highly confident that of
>>> all the publically available projects in the world, the only ones
>>> which touch the problematic parts of the ufunc API are: Numba,
>>> dynd-python, and `gulinalg <https://github.com/ContinuumIO/gulinalg>`_
>>> (with the latter's exposure being trivial).
>>> 
>>> Given this, I propose that for 1.11 we:
>>> 1) go ahead and hide/disable the problematic parts of the ABI/API,
>>> 2) coordinate with the known affected projects to minimize disruption
>>> to their users (which is made easier since they are all projects that
>>> are almost exclusively distributed via conda, which enforces strict
>>> NumPy ABI versioning),
>>> 3) publicize these changes widely so as to give any private code that
>>> might be affected a chance to speak up or adapt, and
>>> 4) leave the "ABI version tag" as it is, so as not to force rebuilds
>>> of the vast majority of projects that will be unaffected by these
>>> changes.
>>> 
>>> This NEP defers the question of exactly what the improved API should
>>> be, since there's no point in trying to nail down the details until
>>> we've decided whether it's even possible to change.
>>> 
>>> 
>>> Details
>>> =======
>>> 
>>> The problem
>>> -----------
>>> 
>>> Currently, a ufunc inner loop implementation is called via the
>>> following function prototype::
>>> 
>>>    typedef void (*PyUFuncGenericFunction)
>>>                (char **args,
>>>                 npy_intp *dimensions,
>>>                 npy_intp *strides,
>>>                 void *innerloopdata);
>>> 
>>> Here ``args`` is an array of pointers to 1-d buffers of input/output
>>> data, ``dimensions`` is a pointer to the number of entries in these
>>> buffers, ``strides`` is an array of integers giving the strides for
>>> each input/output array, and ``innerloopdata`` is an arbitrary void*
>>> supplied by whoever registered the ufunc loop. (For gufuncs, extra
>>> shape and stride information about the core dimensions also gets
>>> packed into the ends of these arrays in a somewhat complicated way.)
>>> 
>>> There are 4 key items that define a NumPy array: data, shape, strides,
>>> dtype. Notice that this function only gets access to 3 of them. Our
>>> goal is to fix that. For example, a better signature would be::
>>> 
>>>    typedef void (*PyUFuncGenericFunction_NEW)
>>>                (char **data,
>>>                 npy_intp *shapes,
>>>                 npy_intp *strides,
>>>                 PyArray_Descr *dtypes,   /* NEW */
>>>                 void *innerloopdata);
>>> 
>>> (In practice I suspect we might want to make some more changes as
>>> well, like upgrading gufunc core shape/strides to proper arguments
>>> instead of tacking it onto the existing arrays, and adding an "escape
>>> valve" void* reserved for future extensions. But working out such
>>> details is outside the scope of this NEP; the above will do for
>>> illustration.)
>>> 
>>> The goal of this NEP is to clear the ground so that we can start
>>> supporting ufunc inner loops that take dtype arguments, and make other
>>> enhancements to ufunc functionality going forward.
>>> 
>>> 
>>> Proposal
>>> --------
>>> 
>>> Currently, the public API/ABI for ufuncs consists of the functions::
>>> 
>>>    PyUFunc_GenericFunction
>>> 
>>>    PyUFunc_FromFuncAndData
>>>    PyUFunc_FromFuncAndDataAndSignature
>>>    PyUFunc_RegisterLoopForDescr
>>>    PyUFunc_RegisterLoopForType
>>> 
>>>    PyUFunc_ReplaceLoopBySignature
>>>    PyUFunc_SetUsesArraysAsData
>>> 
>>> together with direct access to PyUFuncObject's internal fields::
>>> 
>>>    typedef struct {
>>>        PyObject_HEAD
>>>        int nin, nout, nargs;
>>>        int identity;
>>>        PyUFuncGenericFunction *functions;
>>>        void **data;
>>>        int ntypes;
>>>        int check_return;
>>>        const char *name;
>>>        char *types;
>>>        const char *doc;
>>>        void *ptr;
>>>        PyObject *obj;
>>>        PyObject *userloops;
>>>        int core_enabled;
>>>        int core_num_dim_ix;
>>>        int *core_num_dims;
>>>        int *core_dim_ixs;
>>>        int *core_offsets;
>>>        char *core_signature;
>>>        PyUFunc_TypeResolutionFunc *type_resolver;
>>>        PyUFunc_LegacyInnerLoopSelectionFunc *legacy_inner_loop_selector;
>>>        PyUFunc_InnerLoopSelectionFunc *inner_loop_selector;
>>>        PyUFunc_MaskedInnerLoopSelectionFunc *masked_inner_loop_selector;
>>>        npy_uint32 *op_flags;
>>>        npy_uint32 iter_flags;
>>>    } PyUFuncObject;
>>> 
>>> Obviously almost any future changes to how ufuncs work internally will
>>> involve touching some part of this public API/ABI.
>>> 
>>> Concretely, the proposal here is that we avoid this by disabling the
>>> following functions (i.e., any attempt to call them should simply
>>> raise a ``NotImplementedError``)::
>>> 
>>>    PyUFunc_ReplaceLoopBySignature
>>>    PyUFunc_SetUsesArraysAsData
>>> 
>>> and that we reduce the publicly visible portion of PyUFuncObject down to::
>>> 
>>>    typedef struct {
>>>        PyObject_HEAD
>>>        int nin, nout, nargs;
>>>    } PyUFuncObject;
>>> 
>>> 
>>> Data on current API/ABI usage
>>> -----------------------------
>>> 
>>> In order to assess how much code would be affected by this proposal, I
>>> used a combination of Github search and Searchcode.com to trawl
>>> through the majority of all publicly available open source code.
>>> Neither search tool provides a fine-grained enough query language to
>>> directly tell us what we want to know, so I instead followed the
>>> strategy of first, casting a wide net: picking a set of search terms
>>> that are likely to catch all possibly-broken code (together with many
>>> false positives), and second, using automated tools to sift out the
>>> false positives and see what remained. Altogether, I reviewed 4464
>>> search results.
>>> 
>>> The tool I wrote to do this is `available on github
>>> <https://github.com/njsmith/codetrawl>`_, and so is `the analysis code
>>> itself <https://github.com/njsmith/ufunc-abi-analysis>`_.
>>> 
>>> 
>>> Uses of PyUFuncObject internals
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> 
>>> There are no functions in the public API which return
>>> ``PyUFuncObject*`` values directly, so any code that access
>>> PyUFuncObject fields will have to mention that token in the course of
>>> defining a variable, performing a cast, setting up a typedef, etc.
>>> 
>>> Therefore, I searched Github for all files written in C, C++,
>>> Objective C, Python, or Cython, which mentioned either "PyUFuncObject
>>> AND types" or "PyUFuncObject AND NOT types". (This is to work around
>>> limitations on how many results Github search is willing to return to
>>> a single query.) In addition, I searched for ``PyUFuncObject`` on
>>> searchcode.com.
>>> 
>>> The full report on these searches is available here:
>>> 
>>> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html
>>> 
>>> The following were screened out as non-problems:
>>> 
>>> - Copies of NumPy itself (an astonishing number of people have checked
>>> in copies of it to their own source tree)
>>> - NumPy forks / precursors / etc. (e.g. Numeric also had a type called
>>> PyUFuncObject, the "bohrium" project has a fork of numpy 1.6, etc.)
>>> - Cython-generated boilerplate used to generate the "object has
>>> changed size" warning (which we `unconditionally filter out anyway
>>> <https://github.com/numpy/numpy/blob/master/numpy/__init__.py#L226>`_)
>>> - Lots of calls to ``PyUFunc_RegisterLoopForType`` and friends, which
>>> require casting the first argument to ``PyUFuncObject*``
>>> - Misc. other unproblematic stuff (like Cython header declarations
>>> that never get used)
>>> 
>>> There were also several cases that actually referenced PyUFuncObject
>>> internal fields:
>>> 
>>> - The "rational" dtype from numpy-dtypes, which is used in a few
>>> projects, accesses ``ufunc->nargs`` as a safety check, but does not
>>> touch any other fields (`see here
>>> 
>>> <https://github.com/numpy/numpy-dtypes/blob/c0175a6b1c5aa89b4520b29487f06d0e200e2a03/npytypes/rational/rational.c#L1140-L1151>`_).
>>> 
>>> - Numba: does some rather elaborate things to support the definition
>>> of on-the-fly JITted ufuncs. These seem to be clear deficiencies in
>>> the ufunc API (e.g., there's no good way to control the lifespan of
>>> the array of function pointers passed to ``PyUFunc_FromFuncAndData``),
>>> so we should work with them to provide the API they need to do this in
>>> a maintainable way. Some of the relevant code:
>>> 
>>>  https://github.com/numba/numba/tree/master/numba/npyufunc
>>> 
>>> https://github.com/numba/numba/blob/98752647a95ac6c9d480e81ca5c8afcfa3ddfd18/numba/npyufunc/_internal.c
>>> 
>>> - dynd-python: Contains some code that attempts to extract the inner
>>> loops from a numpy ufunc object and wrap them into the dynd 'ufunc'
>>> equivalent:
>>> 
>>> https://github.com/libdynd/dynd-python/blob/c06f8fc4e72257abac589faf76f10df8c045159b/dynd/src/numpy_ufunc_kernel.cpp
>>> 
>>> - gulinalg: I'm not sure if anyone is still using this code since most
>>> of it was merged into numpy itself, but it's not a big deal in any
>>> case: all it contains is a `debugging function
>>> 
>>> <https://github.com/ContinuumIO/gulinalg/blob/2ef365c48427c026dab4f45dc6f8b1b9af184460/gulinalg/src/gulinalg.c.src#L527-L550>`_
>>> that dumps some internal fields from the PyUFuncObject. If you look,
>>> though, all calls to this function are already commented out :-).
>>> 
>>> The full report is available here:
>>> 
>>> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html
>>> 
>>> In the course of this analysis, it was also noted that the standard
>>> Cython pxd files contain a wrapper for ufunc objects::
>>> 
>>>    cdef class ufunc [object PyUFuncObject]:
>>>        ...
>>> 
>>> which means that Cython code can access internal struct fields via an
>>> object of type ``ufunc``, and thus escape our string-based search
>>> above. Therefore I also examined all Cython files on Github or
>>> searchcode.com that matched the query ``ufunc``, and searched for any
>>> lines matching any of the following regular expressions::
>>> 
>>>    cdef\W+ufunc
>>>       catches: 'cdef ufunc fn'
>>>    cdef\W+.*\.\W*ufunc
>>>       catches: 'cdef np.ufunc fn'
>>>    <.*ufunc\W*>
>>>       catches: '(<ufunc> fn).nargs', '(< np.ufunc > fn).nargs'
>>>    cdef.*\(.*ufunc
>>>       catches: 'cdef doit(np.ufunc fn, ...):'
>>> 
>>> (I considered parsing the actual source and analysing it that way, but
>>> decided I was too lazy. This could be done if anyone is worried that
>>> the above regexes might miss things though.)
>>> 
>>> There were zero files that contained matches for any of the above regexes:
>>> 
>>> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/ufunc-cython-report.html
>>> 
>>> 
>>> Uses of PyUFunc_ReplaceLoopBySignature
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> 
>>> Applying the same screening as above, the only code that was found
>>> that used this function is also in Numba:
>>> 
>>> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_ReplaceLoopBySignature-report.html
>>> 
>>> 
>>> Uses of PyUFunc_SetUsesArraysAsData
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> 
>>> Aside from being semi-broken since 1.7 (it never got implemented for
>>> "masked" ufunc loops, i.e. those that use where=), there appear to be
>>> zero uses of this functionality either inside or outside NumPy:
>>> 
>>> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_SetUsesArraysAsData-report.html
>>> 
>>> 
>>> Rationale
>>> ---------
>>> 
>>> **Rationale for preserving the remaining API functions**::
>>> 
>>>    PyUFunc_GenericFunction
>>> 
>>>    PyUFunc_FromFuncAndData
>>>    PyUFunc_FromFuncAndDataAndSignature
>>>    PyUFunc_RegisterLoopForDescr
>>>    PyUFunc_RegisterLoopForType
>>> 
>>> In addition to being widely used, these functions can easily be
>>> preserved even if we change how ufuncs work internally, because they
>>> only ingest loop function pointers, they never return them. So they
>>> can easily be modified to wrap whatever loop function(s) they receive
>>> inside an adapter function that calls them at the appropriate time,
>>> and then register that adapter function using whatever API we add in
>>> the future.
>>> 
>>> **Rationale for preserving the particular fields that are preserved**:
>>> Preserving ``nargs`` lets us avoid a little bit of breakage with the
>>> random dtype, and it doesn't seem like preserving ``nin``, ``nout``,
>>> ``nargs`` fields will produce any undue burden on future changes to
>>> ufunc internals; even if we were to introduce variadic ufuncs we could
>>> always just stick a -1 in the appropriate fields or whatever.
>>> 
>>> **Rationale for removing PyUFunc_ReplaceLoopBySignature**: this
>>> function *returns* the PyUFunc_GenericFunction that was replaced; if
>>> we stop representing all loops using the legacy
>>> PyUFunc_GenericFunction type, then this will not be possible to do in
>>> the future.
>>> 
>>> **Rationale for removing PyUFunc_SetUsesArraysAsData**: If set as the
>>> ``innerloopdata`` on a ufunc loop, then this function acts as a
>>> sentinel value, and causes the ``innerloopdata`` to instead be set to
>>> a pointer to the passed-in PyArrayObjects.  In principle we could
>>> preserve this function, but it has a number of deficiencies:
>>> - No-one appears to use it.
>>> - It's been buggy for several releases and no-one noticed.
>>> - AFAIK the only reason it was included in the first place is that it
>>> provides a backdoor for ufunc loops to get access to the dtypes -- but
>>> we are planning to fix this in a better way.
>>> - It can't be shimmed as easily as the loop registration functions,
>>> because we don't anticipate that the new-and-improved ufunc loop
>>> functions will *get* access to the array objects, only to the dtypes;
>>> so this would have to remain cluttering up the core dispatch path
>>> indefinitely.
>>> - We have good reason for *not* wanting to get ufunc loops get access
>>> to the actual array objects, because one of the goals on our roadmap
>>> is exactly to enable the use of ufuncs on non-ndarray objects. Giving
>>> ufuncs access to dtypes alone creates a clean boundary here: it
>>> guarantees that ufunc loops can work equally on all duck-array objects
>>> (so long as they have a dtype), and enforces the invariant that
>>> anything which affects the interpretation of data values should be
>>> attached to the dtype, not to the array object.
>>> 
>>> 
>>> Rejected alternatives
>>> ---------------------
>>> 
>>> **Do nothing**: there's no way we'll ever be able to touch ufuncs at
>>> all if we don't hide those fields sooner or later. While any amount of
>>> breakage is regrettable, the costs of cleaning things up now are less
>>> than the costs of never improving numpy's APIs.
>>> 
>>> **Somehow sneak the dtype information in via ``void
>>> *innerloopdata``**: This might let us preserve the signature of
>>> PyUFunc_GenericFunction, and thus preserve
>>> PyUFunc_ReplaceLoopBySignature. But we'd still have the problem of
>>> leaving way too much internal state exposed, and it's not even clear
>>> how this would work, given that we actually do want to preserve the
>>> use of ``innerloopdata`` for actual per-loop data. (This is where the
>>> PyUFunc_SetUsesArraysAsData hack fails.)
>>> 
>>> 
>>> --
>>> Nathaniel J. Smith -- http://vorpus.org
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>> 
>> 
>> 
>> 
>> --
>> (\__/)
>> ( O.o)
>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de
>> dominación mundial.
>> 
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>> 
> 
> 
> 
> -- 
> Nathaniel J. Smith -- http://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion




More information about the NumPy-Discussion mailing list