NEP 37: A dispatch protocol for NumPy-like modules
![](https://secure.gravatar.com/avatar/93a76a800ef6c5919baa8ba91120ee98.jpg?s=120&d=mm&r=g)
I am pleased to present a new NumPy Enhancement Proposal for discussion: "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be very welcome! The full text follows. The rendered proposal can also be found online at https://numpy.org/neps/nep-0037-array-module.html Best, Stephan Hoyer =================================================== NEP 37 — A dispatch protocol for NumPy-like modules =================================================== :Author: Stephan Hoyer <shoyer@google.com> :Author: Hameer Abbasi :Author: Sebastian Berg :Status: Draft :Type: Standards Track :Created: 2019-12-29 Abstract -------- NEP-18's ``__array_function__`` has been a mixed success. Some projects (e.g., dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a new protocol, ``__array_module__``, that we expect could eventually subsume most use-cases for ``__array_function__``. The protocol requires explicit adoption by both users and library authors, which ensures backwards compatibility, and is also significantly simpler than ``__array_function__``, both of which we expect will make it easier to adopt. Why ``__array_function__`` hasn't been enough --------------------------------------------- There are two broad ways in which NEP-18 has fallen short of its goals: 1. **Maintainability concerns**. `__array_function__` has significant implications for libraries that use it: - Projects like `PyTorch <https://github.com/pytorch/pytorch/issues/22402>`_, `JAX <https://github.com/google/jax/issues/1565>`_ and even `scipy.sparse <https://github.com/scipy/scipy/issues/10362>`_ have been reluctant to implement `__array_function__` in part because they are concerned about **breaking existing code**: users expect NumPy functions like ``np.concatenate`` to return NumPy arrays. This is a fundamental limitation of the ``__array_function__`` design, which we chose to allow overriding the existing ``numpy`` namespace. - ``__array_function__`` currently requires an "all or nothing" approach to implementing NumPy's API. There is no good pathway for **incremental adoption**, which is particularly problematic for established projects for which adopting ``__array_function__`` would result in breaking changes. - It is no longer possible to use **aliases to NumPy functions** within modules that support overrides. For example, both CuPy and JAX set ``result_type = np.result_type``. - Implementing **fall-back mechanisms** for unimplemented NumPy functions by using NumPy's implementation is hard to get right (but see the `version from dask <https://github.com/dask/dask/pull/5043>`_), because ``__array_function__`` does not present a consistent interface. Converting all arguments of array type requires recursing into generic arguments of the form ``*args, **kwargs``. 2. **Limitations on what can be overridden.** ``__array_function__`` has some important gaps, most notably array creation and coercion functions: - **Array creation** routines (e.g., ``np.arange`` and those in ``np.random``) need some other mechanism for indicating what type of arrays to create. `NEP 36 <https://github.com/numpy/numpy/pull/14715>`_ proposed adding optional ``like=`` arguments to functions without existing array arguments. However, we still lack any mechanism to override methods on objects, such as those needed by ``np.random.RandomState``. - **Array conversion** can't reuse the existing coercion functions like ``np.asarray``, because ``np.asarray`` sometimes means "convert to an exact ``np.ndarray``" and other times means "convert to something _like_ a NumPy array." This led to the `NEP 30 <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_ proposal for a separate ``np.duckarray`` function, but this still does not resolve how to cast one duck array into a type matching another duck array. ``get_array_module`` and the ``__array_module__`` protocol ---------------------------------------------------------- We propose a new user-facing mechanism for dispatching to a duck-array implementation, ``numpy.get_array_module``. ``get_array_module`` performs the same type resolution as ``__array_function__`` and returns a module with an API promised to match the standard interface of ``numpy`` that can implement operations on all provided array types. The protocol itself is both simpler and more powerful than ``__array_function__``, because it doesn't need to worry about actually implementing functions. We believe it resolves most of the maintainability and functionality limitations of ``__array_function__``. The new protocol is opt-in, explicit and with local control; see :ref:`appendix-design-choices` for discussion on the importance of these design features. The array module contract ========================= Modules returned by ``get_array_module``/``__array_module__`` should make a best effort to implement NumPy's core functionality on new array types(s). Unimplemented functionality should simply be omitted (e.g., accessing an unimplemented function should raise ``AttributeError``). In the future, we anticipate codifying a protocol for requesting restricted subsets of ``numpy``; see :ref:`requesting-restricted-subsets` for more details. How to use ``get_array_module`` =============================== Code that wants to support generic duck arrays should explicitly call ``get_array_module`` to determine an appropriate array module from which to call functions, rather than using the ``numpy`` namespace directly. For example: .. code:: python # calls the appropriate version of np.something for x and y module = np.get_array_module(x, y) module.something(x, y) Both array creation and array conversion are supported, because dispatching is handled by ``get_array_module`` rather than via the types of function arguments. For example, to use random number generation functions or methods, we can simply pull out the appropriate submodule: .. code:: python def duckarray_add_random(array): module = np.get_array_module(array) noise = module.random.randn(*array.shape) return array + noise We can also write the duck-array ``stack`` function from `NEP 30 <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_, without the need for a new ``np.duckarray`` function: .. code:: python def duckarray_stack(arrays): module = np.get_array_module(*arrays) arrays = [module.asarray(arr) for arr in arrays] shapes = {arr.shape for arr in arrays} if len(shapes) != 1: raise ValueError('all input arrays must have the same shape') expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] return module.concatenate(expanded_arrays, axis=0) By default, ``get_array_module`` will return the ``numpy`` module if no arguments are arrays. This fall-back can be explicitly controlled by providing the ``module`` keyword-only argument. It is also possible to indicate that an exception should be raised instead of returning a default array module by setting ``module=None``. How to implement ``__array_module__`` ===================================== Libraries implementing a duck array type that want to support ``get_array_module`` need to implement the corresponding protocol, ``__array_module__``. This new protocol is based on Python's dispatch protocol for arithmetic, and is essentially a simpler version of ``__array_function__``. Only one argument is passed into ``__array_module__``, a Python collection of unique array types passed into ``get_array_module``, i.e., all arguments with an ``__array_module__`` attribute. The special method should either return an namespace with an API matching ``numpy``, or ``NotImplemented``, indicating that it does not know how to handle the operation: .. code:: python class MyArray: def __array_module__(self, types): if not all(issubclass(t, MyArray) for t in types): return NotImplemented return my_array_module Returning custom objects from ``__array_module__`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``my_array_module`` will typically, but need not always, be a Python module. Returning a custom objects (e.g., with functions implemented via ``__getattr__``) may be useful for some advanced use cases. For example, custom objects could allow for partial implementations of duck array modules that fall-back to NumPy (although this is not recommended in general because such fall-back behavior can be error prone): .. code:: python class MyArray: def __array_module__(self, types): if all(issubclass(t, MyArray) for t in types): return ArrayModule() else: return NotImplemented class ArrayModule: def __getattr__(self, name): import base_module return getattr(base_module, name, getattr(numpy, name)) Subclassing from ``numpy.ndarray`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All of the same guidance about well-defined type casting hierarchies from NEP-18 still applies. ``numpy.ndarray`` itself contains a matching implementation of ``__array_module__``, which is convenient for subclasses: .. code:: python class ndarray: def __array_module__(self, types): if all(issubclass(t, ndarray) for t in types): return numpy else: return NotImplemented NumPy's internal machinery ========================== The type resolution rules of ``get_array_module`` follow the same model as Python and NumPy's existing dispatch protocols: subclasses are called before super-classes, and otherwise left to right. ``__array_module__`` is guaranteed to be called only a single time on each unique type. The actual implementation of `get_array_module` will be in C, but should be equivalent to this Python code: .. code:: python def get_array_module(*arrays, default=numpy): implementing_arrays, types = _implementing_arrays_and_types(arrays) if not implementing_arrays and default is not None: return default for array in implementing_arrays: module = array.__array_module__(types) if module is not NotImplemented: return module raise TypeError("no common array module found") def _implementing_arrays_and_types(relevant_arrays): types = [] implementing_arrays = [] for array in relevant_arrays: t = type(array) if t not in types and hasattr(t, '__array_module__'): types.append(t) # Subclasses before superclasses, otherwise left to right index = len(implementing_arrays) for i, old_array in enumerate(implementing_arrays): if issubclass(t, type(old_array)): index = i break implementing_arrays.insert(index, array) return implementing_arrays, types Relationship with ``__array_ufunc__`` and ``__array_function__`` ---------------------------------------------------------------- These older protocols have distinct use-cases and should remain =============================================================== ``__array_module__`` is intended to resolve limitations of ``__array_function__``, so it is natural to consider whether it could entirely replace ``__array_function__``. This would offer dual benefits: (1) simplifying the user-story about how to override NumPy and (2) removing the slowdown associated with checking for dispatch when calling every NumPy function. However, ``__array_module__`` and ``__array_function__`` are pretty different from a user perspective: it requires explicit calls to ``get_array_function``, rather than simply reusing original ``numpy`` functions. This is probably fine for *libraries* that rely on duck-arrays, but may be frustratingly verbose for interactive use. Some of the dispatching use-cases for ``__array_ufunc__`` are also solved by ``__array_module__``, but not all of them. For example, it is still useful to be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a generic way on non-NumPy arrays (e.g., with dask.array). Given their existing adoption and distinct use cases, we don't think it makes sense to remove or deprecate ``__array_function__`` and ``__array_ufunc__`` at this time. Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` ========================================================================= Despite the user-facing differences, ``__array_module__`` and a module implementing NumPy's API still contain sufficient functionality needed to implement dispatching with the existing duck array protocols. For example, the following mixin classes would provide sensible defaults for these special methods in terms of ``get_array_module`` and ``__array_module__``: .. code:: python class ArrayUfuncFromModuleMixin: def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): arrays = inputs + kwargs.get('out', ()) try: array_module = np.get_array_module(*arrays) except TypeError: return NotImplemented try: # Note this may have false positive matches, if ufunc.__name__ # matches the name of a ufunc defined by NumPy. Unfortunately # there is no way to determine in which module a ufunc was # defined. new_ufunc = getattr(array_module, ufunc.__name__) except AttributeError: return NotImplemented try: callable = getattr(new_ufunc, method) except AttributeError: return NotImplemented return callable(*inputs, **kwargs) class ArrayFunctionFromModuleMixin: def __array_function__(self, func, types, args, kwargs): array_module = self.__array_module__(types) if array_module is NotImplemented: return NotImplemented # Traverse submodules to find the appropriate function modules = func.__module__.split('.') assert modules[0] == 'numpy' for submodule in modules[1:]: module = getattr(module, submodule, None) new_func = getattr(module, func.__name__, None) if new_func is None: return NotImplemented return new_func(*args, **kwargs) To make it easier to write duck arrays, we could also add these mixin classes into ``numpy.lib.mixins`` (but the examples above may suffice). Alternatives considered ----------------------- Naming ====== We like the name ``__array_module__`` because it mirrors the existing ``__array_function__`` and ``__array_ufunc__`` protocols. Another reasonable choice could be ``__array_namespace__``. It is less clear what the NumPy function that calls this protocol should be called (``get_array_module`` in this proposal). Some possible alternatives: ``array_module``, ``common_array_module``, ``resolve_array_module``, ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, ``get_duck_array_module``. .. _requesting-restricted-subsets: Requesting restricted subsets of NumPy's API ============================================ Over time, NumPy has accumulated a very large API surface, with over 600 attributes in the top level ``numpy`` module alone. It is unlikely that any duck array library could or would want to implement all of these functions and classes, because the frequently used subset of NumPy is much smaller. We think it would be useful exercise to define "minimal" subset(s) of NumPy's API, omitting rarely used or non-recommended functionality. For example, minimal NumPy might include ``stack``, but not the other stacking functions ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could clearly indicate to duck array authors and users want functionality is core and what functionality they can skip. Support for requesting a restricted subset of NumPy's API would be a natural feature to include in ``get_array_function`` and ``__array_module__``, e.g., .. code:: python # array_module is only guaranteed to contain "minimal" NumPy array_module = np.get_array_module(*arrays, request='minimal') To facilitate testing with NumPy and use with any valid duck array library, NumPy itself would return restricted versions of the ``numpy`` module when ``get_array_module`` is called only on NumPy arrays. Omitted functions would simply not exist. Unfortunately, we have not yet figured out what these restricted subsets should be, so it doesn't make sense to do this yet. When/if we do, we could either add new keyword arguments to ``get_array_module`` or add new top level functions, e.g., ``get_minimal_array_module``. We would also need to add either a new protocol patterned off of ``__array_module__`` (e.g., ``__array_module_minimal__``), or could add an optional second argument to ``__array_module__`` (catching errors with ``try``/``except``). A new namespace for implicit dispatch ===================================== Instead of supporting overrides in the main `numpy` namespace with ``__array_function__``, we could create a new opt-in namespace, e.g., ``numpy.api``, with versions of NumPy functions that support dispatching. These overrides would need new opt-in protocols, e.g., ``__array_function_api__`` patterned off of ``__array_function__``. This would resolve the biggest limitations of ``__array_function__`` by being opt-in and would also allow for unambiguously overriding functions like ``asarray``, because ``np.api.asarray`` would always mean "convert an array-like object." But it wouldn't solve all the dispatching needs met by ``__array_module__``, and would leave us with supporting a considerably more complex protocol both for array users and implementors. We could potentially implement such a new namespace *via* the ``__array_module__`` protocol. Certainly some users would find this convenient, because it is slightly less boilerplate. But this would leave users with a confusing choice: when should they use `get_array_module` vs. `np.api.something`. Also, we would have to add and maintain a whole new module, which is considerably more expensive than merely adding a function. Dispatching on both types and arrays instead of only types ========================================================== Instead of supporting dispatch only via unique array types, we could also support dispatch via array objects, e.g., by passing an ``arrays`` argument as part of the ``__array_module__`` protocol. This could potentially be useful for dispatch for arrays with metadata, such provided by Dask and Pint, but would impose costs in terms of type safety and complexity. For example, a library that supports arrays on both CPUs and GPUs might decide on which device to create a new arrays from functions like ``ones`` based on input arguments: .. code:: python class Array: def __array_module__(self, types, arrays): useful_arrays = tuple(a in arrays if isinstance(a, Array)) if not useful_arrays: return NotImplemented prefer_gpu = any(a.prefer_gpu for a in useful_arrays) return ArrayModule(prefer_gpu) class ArrayModule: def __init__(self, prefer_gpu): self.prefer_gpu = prefer_gpu def __getattr__(self, name): import base_module base_func = getattr(base_module, name) return functools.partial(base_func, prefer_gpu=self.prefer_gpu) This might be useful, but it's not clear if we really need it. Pint seems to get along OK without any explicit array creation routines (favoring multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most part Dask is also OK with existing ``__array_function__`` style overides (e.g., favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an array on the CPU or GPU could be solved by `making array creation lazy <https://github.com/google/jax/pull/1668>`_. .. _appendix-design-choices: Appendix: design choices for API overrides ------------------------------------------ There is a large range of possible design choices for overriding NumPy's API. Here we discuss three major axes of the design decision that guided our design for ``__array_module__``. Opt-in vs. opt-out for users ============================ The ``__array_ufunc__`` and ``__array_function__`` protocols provide a mechanism for overriding NumPy functions *within NumPy's existing namespace*. This means that users need to explicitly opt-out if they do not want any overridden behavior, e.g., by casting arrays with ``np.asarray()``. In theory, this approach lowers the barrier for adopting these protocols in user code and libraries, because code that uses the standard NumPy namespace is automatically compatible. But in practice, this hasn't worked out. For example, most well-maintained libraries that use NumPy follow the best practice of casting all inputs with ``np.asarray()``, which they would have to explicitly relax to use ``__array_function__``. Our experience has been that making a library compatible with a new duck array type typically requires at least a small amount of work to accommodate differences in the data model and operations that can be implemented efficiently. These opt-out approaches also considerably complicate backwards compatibility for libraries that adopt these protocols, because by opting in as a library they also opt-in their users, whether they expect it or not. For winning over libraries that have been unable to adopt ``__array_function__``, an opt-in approach seems like a must. Explicit vs. implicit choice of implementation ============================================== Both ``__array_ufunc__`` and ``__array_function__`` have implicit control over dispatching: the dispatched functions are determined via the appropriate protocols in every function call. This generalizes well to handling many different types of objects, as evidenced by its use for implementing arithmetic operators in Python, but it has two downsides: 1. *Speed*: it imposes additional overhead in every function call, because each function call needs to inspect each of its arguments for overrides. This is why arithmetic on builtin Python numbers is slow. 2. *Readability*: it is not longer immediately evident to readers of code what happens when a function is called, because the function's implementation could be overridden by any of its arguments. In contrast, importing a new library (e.g., ``import dask.array as da``) with an API matching NumPy is entirely explicit. There is no overhead from dispatch or ambiguity about which implementation is being used. Explicit and implicit choice of implementations are not mutually exclusive options. Indeed, most implementations of NumPy API overrides via ``__array_function__`` that we are familiar with (namely, dask, CuPy and sparse, but not Pint) also include an explicit way to use their version of NumPy's API by importing a module directly (``dask.array``, ``cupy`` or ``sparse``, respectively). Local vs. non-local vs. global control ====================================== The final design axis is how users control the choice of API: - **Local control**, as exemplified by multiple dispatch and Python protocols for arithmetic, determines which implementation to use either by checking types or calling methods on the direct arguments of a function. - **Non-local control** such as `np.errstate <https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html >`_ overrides behavior with global-state via function decorators or context-managers. Control is determined hierarchically, via the inner-most context. - **Global control** provides a mechanism for users to set default behavior, either via function calls or configuration files. For example, matplotlib allows setting a global choice of plotting backend. Local control is generally considered a best practice for API design, because control flow is entirely explicit, which makes it the easiest to understand. Non-local and global control are occasionally used, but generally either due to ignorance or a lack of better alternatives. In the case of duck typing for NumPy's public API, we think non-local or global control would be mistakes, mostly because they **don't compose well**. If one library sets/needs one set of overrides and then internally calls a routine that expects another set of overrides, the resulting behavior may be very surprising. Higher order functions are especially problematic, because the context in which functions are evaluated may not be the context in which they are defined. One class of override use cases where we think non-local and global control are appropriate is for choosing a backend system that is guaranteed to have an entirely consistent interface, such as a faster alternative implementation of ``numpy.fft`` on NumPy arrays. However, these are out of scope for the current proposal, which is focused on duck arrays.
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
Thanks, maybe to start discussion floating the actual usage here: ``` def add_noise(array_like): module = np.get_array_module(array_like) noise = module.random.randn(*array_like.shape) return array_like + noise ``` The above function could also include `module.asarray(array_like)` to support non-array inputs. Importantly, the random function, and especially array creation functions such as `empty` and `ones` can work. To summarize I think there are two main things that this NEP can address: 1. Some libraries are reluctant to adopt `__array_function__`, but they could adopt this NEP. 2. Libraries written for numpy (scipy, sklearn, etc.) often use `np.asarray` and `__array_function__` does not help them easily. This NEP hopefully gives them a way forward. We may need to prototype some examples, but right now it feels like this should be a step forward, especially for libraries. Of course there are other similar design options, so discussions (or criticism of this idea) are welcome. I believe this can help libraries, i.e. if skimage only feels confident that they support Dask, they can still do: ``` module = np.get_array_module(*input_arrays) if module not in {np, dask.numpy_api}: raise TypeError("This function only supports numpy and Dask.") ``` I do not think this is as cleanly possibly with `__array_function__`. Best, Sebastian On Mon, 2020-01-06 at 20:29 -0800, Stephan Hoyer wrote:
![](https://secure.gravatar.com/avatar/84f619d24e0f165f2ee36db34a911c4a.jpg?s=120&d=mm&r=g)
A bit late to the NEP 37 party. I just wanted to say that at least from my perspective it seems a great solution that will help sklearn move towards more flexible compute engines. I think one of the biggest issues is array creation (including random arrays), and that's handled quite nicely with NEP 37. There's some discussion on the scikit-learn side here: https://github.com/scikit-learn/scikit-learn/pull/14963 https://github.com/scikit-learn/scikit-learn/issues/11447 Two different groups of people tried to use __array_function__ to delegate to MxNet and CuPy respectively in scikit-learn, and ran into the same issues. There's some remaining issues in sklearn that will not be handled by NEP 37 but they go beyond NumPy in some sense. Just to briefly bring them up: - We use scipy.linalg in many places, and we would need to do a separate dispatching to check whether we can use module.linalg instead (that might be an issue for many libraries but I'm not sure). - Some models have several possible optimization algorithms, some of which are pure numpy and some which are Cython. If someone provides a different array module, we might want to choose an algorithm that is actually supported by that module. While this exact issue is maybe sklearn specific, a similar issue could appear for most downstream libs that use Cython in some places. Many Cython algorithms could be implemented in pure numpy with a potential slowdown, but once we have NEP 37 there might be a benefit to having a pure NumPy implementation as an alternative code path. Anyway, NEP 37 seems a great step in the right direction and would enable sklearn to actually dispatch in some places. Dispatching just based on __array_function__ seems not really feasible so far. Best, Andreas Mueller On 1/6/20 11:29 PM, Stephan Hoyer wrote:
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Wed, Feb 5, 2020 at 10:01 AM Andreas Mueller <t3kcit@gmail.com> wrote: > A bit late to the NEP 37 party. > I just wanted to say that at least from my perspective it seems a great > solution that will help sklearn move towards more flexible compute engines. > I think one of the biggest issues is array creation (including random > arrays), and that's handled quite nicely with NEP 37. > > There's some discussion on the scikit-learn side here: > https://github.com/scikit-learn/scikit-learn/pull/14963 > https://github.com/scikit-learn/scikit-learn/issues/11447 > > Two different groups of people tried to use __array_function__ to delegate > to MxNet and CuPy respectively in scikit-learn, and ran into the same > issues. > > There's some remaining issues in sklearn that will not be handled by NEP > 37 but they go beyond NumPy in some sense. > Just to briefly bring them up: > > - We use scipy.linalg in many places, and we would need to do a separate > dispatching to check whether we can use module.linalg instead > (that might be an issue for many libraries but I'm not sure). > That is an issue, and goes in the opposite direction we need - scipy.linalg is a superset of numpy.linalg, so we'd like to encourage using scipy. This is something we may want to consider fixing by making the dispatch decorator public in numpy and adopting in scipy. Cheers, Ralf > > - Some models have several possible optimization algorithms, some of which > are pure numpy and some which are Cython. If someone provides a different > array module, > we might want to choose an algorithm that is actually supported by that > module. While this exact issue is maybe sklearn specific, a similar issue > could appear for most downstream libs that use Cython in some places. > Many Cython algorithms could be implemented in pure numpy with a > potential slowdown, but once we have NEP 37 there might be a benefit to > having a pure NumPy implementation as an alternative code path. > > > Anyway, NEP 37 seems a great step in the right direction and would enable > sklearn to actually dispatch in some places. Dispatching just based on > __array_function__ seems not really feasible so far. > > Best, > Andreas Mueller > > > On 1/6/20 11:29 PM, Stephan Hoyer wrote: > > I am pleased to present a new NumPy Enhancement Proposal for discussion: > "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be > very welcome! > > The full text follows. The rendered proposal can also be found online at > https://numpy.org/neps/nep-0037-array-module.html > > Best, > Stephan Hoyer > > =================================================== > NEP 37 — A dispatch protocol for NumPy-like modules > =================================================== > > :Author: Stephan Hoyer <shoyer@google.com> > :Author: Hameer Abbasi > :Author: Sebastian Berg > :Status: Draft > :Type: Standards Track > :Created: 2019-12-29 > > Abstract > -------- > > NEP-18's ``__array_function__`` has been a mixed success. Some projects > (e.g., > dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others > (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a new > protocol, ``__array_module__``, that we expect could eventually subsume > most > use-cases for ``__array_function__``. The protocol requires explicit > adoption > by both users and library authors, which ensures backwards compatibility, > and > is also significantly simpler than ``__array_function__``, both of which we > expect will make it easier to adopt. > > Why ``__array_function__`` hasn't been enough > --------------------------------------------- > > There are two broad ways in which NEP-18 has fallen short of its goals: > > 1. **Maintainability concerns**. `__array_function__` has significant > implications for libraries that use it: > > - Projects like `PyTorch > <https://github.com/pytorch/pytorch/issues/22402>`_, `JAX > <https://github.com/google/jax/issues/1565>`_ and even `scipy.sparse > <https://github.com/scipy/scipy/issues/10362>`_ have been reluctant > to > implement `__array_function__` in part because they are concerned > about > **breaking existing code**: users expect NumPy functions like > ``np.concatenate`` to return NumPy arrays. This is a fundamental > limitation of the ``__array_function__`` design, which we chose to > allow > overriding the existing ``numpy`` namespace. > - ``__array_function__`` currently requires an "all or nothing" > approach to > implementing NumPy's API. There is no good pathway for **incremental > adoption**, which is particularly problematic for established projects > for which adopting ``__array_function__`` would result in breaking > changes. > - It is no longer possible to use **aliases to NumPy functions** within > modules that support overrides. For example, both CuPy and JAX set > ``result_type = np.result_type``. > - Implementing **fall-back mechanisms** for unimplemented NumPy > functions > by using NumPy's implementation is hard to get right (but see the > `version from dask <https://github.com/dask/dask/pull/5043>`_), > because > ``__array_function__`` does not present a consistent interface. > Converting all arguments of array type requires recursing into generic > arguments of the form ``*args, **kwargs``. > > 2. **Limitations on what can be overridden.** ``__array_function__`` has > some > important gaps, most notably array creation and coercion functions: > > - **Array creation** routines (e.g., ``np.arange`` and those in > ``np.random``) need some other mechanism for indicating what type of > arrays to create. `NEP 36 <https://github.com/numpy/numpy/pull/14715 > >`_ > proposed adding optional ``like=`` arguments to functions without > existing array arguments. However, we still lack any mechanism to > override methods on objects, such as those needed by > ``np.random.RandomState``. > - **Array conversion** can't reuse the existing coercion functions like > ``np.asarray``, because ``np.asarray`` sometimes means "convert to an > exact ``np.ndarray``" and other times means "convert to something > _like_ > a NumPy array." This led to the `NEP 30 > <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_ > proposal for > a separate ``np.duckarray`` function, but this still does not resolve > how > to cast one duck array into a type matching another duck array. > > ``get_array_module`` and the ``__array_module__`` protocol > ---------------------------------------------------------- > > We propose a new user-facing mechanism for dispatching to a duck-array > implementation, ``numpy.get_array_module``. ``get_array_module`` performs > the > same type resolution as ``__array_function__`` and returns a module with > an API > promised to match the standard interface of ``numpy`` that can implement > operations on all provided array types. > > The protocol itself is both simpler and more powerful than > ``__array_function__``, because it doesn't need to worry about actually > implementing functions. We believe it resolves most of the maintainability > and > functionality limitations of ``__array_function__``. > > The new protocol is opt-in, explicit and with local control; see > :ref:`appendix-design-choices` for discussion on the importance of these > design > features. > > The array module contract > ========================= > > Modules returned by ``get_array_module``/``__array_module__`` should make a > best effort to implement NumPy's core functionality on new array types(s). > Unimplemented functionality should simply be omitted (e.g., accessing an > unimplemented function should raise ``AttributeError``). In the future, we > anticipate codifying a protocol for requesting restricted subsets of > ``numpy``; > see :ref:`requesting-restricted-subsets` for more details. > > How to use ``get_array_module`` > =============================== > > Code that wants to support generic duck arrays should explicitly call > ``get_array_module`` to determine an appropriate array module from which to > call functions, rather than using the ``numpy`` namespace directly. For > example: > > .. code:: python > > # calls the appropriate version of np.something for x and y > module = np.get_array_module(x, y) > module.something(x, y) > > Both array creation and array conversion are supported, because > dispatching is > handled by ``get_array_module`` rather than via the types of function > arguments. For example, to use random number generation functions or > methods, > we can simply pull out the appropriate submodule: > > .. code:: python > > def duckarray_add_random(array): > module = np.get_array_module(array) > noise = module.random.randn(*array.shape) > return array + noise > > We can also write the duck-array ``stack`` function from `NEP 30 > <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_, without the > need > for a new ``np.duckarray`` function: > > .. code:: python > > def duckarray_stack(arrays): > module = np.get_array_module(*arrays) > arrays = [module.asarray(arr) for arr in arrays] > shapes = {arr.shape for arr in arrays} > if len(shapes) != 1: > raise ValueError('all input arrays must have the same shape') > expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] > return module.concatenate(expanded_arrays, axis=0) > > By default, ``get_array_module`` will return the ``numpy`` module if no > arguments are arrays. This fall-back can be explicitly controlled by > providing > the ``module`` keyword-only argument. It is also possible to indicate that > an > exception should be raised instead of returning a default array module by > setting ``module=None``. > > How to implement ``__array_module__`` > ===================================== > > Libraries implementing a duck array type that want to support > ``get_array_module`` need to implement the corresponding protocol, > ``__array_module__``. This new protocol is based on Python's dispatch > protocol > for arithmetic, and is essentially a simpler version of > ``__array_function__``. > > Only one argument is passed into ``__array_module__``, a Python collection > of > unique array types passed into ``get_array_module``, i.e., all arguments > with > an ``__array_module__`` attribute. > > The special method should either return an namespace with an API matching > ``numpy``, or ``NotImplemented``, indicating that it does not know how to > handle the operation: > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if not all(issubclass(t, MyArray) for t in types): > return NotImplemented > return my_array_module > > Returning custom objects from ``__array_module__`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ``my_array_module`` will typically, but need not always, be a Python > module. > Returning a custom objects (e.g., with functions implemented via > ``__getattr__``) may be useful for some advanced use cases. > > For example, custom objects could allow for partial implementations of duck > array modules that fall-back to NumPy (although this is not recommended in > general because such fall-back behavior can be error prone): > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if all(issubclass(t, MyArray) for t in types): > return ArrayModule() > else: > return NotImplemented > > class ArrayModule: > def __getattr__(self, name): > import base_module > return getattr(base_module, name, getattr(numpy, name)) > > Subclassing from ``numpy.ndarray`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > All of the same guidance about well-defined type casting hierarchies from > NEP-18 still applies. ``numpy.ndarray`` itself contains a matching > implementation of ``__array_module__``, which is convenient for > subclasses: > > .. code:: python > > class ndarray: > def __array_module__(self, types): > if all(issubclass(t, ndarray) for t in types): > return numpy > else: > return NotImplemented > > NumPy's internal machinery > ========================== > > The type resolution rules of ``get_array_module`` follow the same model as > Python and NumPy's existing dispatch protocols: subclasses are called > before > super-classes, and otherwise left to right. ``__array_module__`` is > guaranteed > to be called only a single time on each unique type. > > The actual implementation of `get_array_module` will be in C, but should be > equivalent to this Python code: > > .. code:: python > > def get_array_module(*arrays, default=numpy): > implementing_arrays, types = _implementing_arrays_and_types(arrays) > if not implementing_arrays and default is not None: > return default > for array in implementing_arrays: > module = array.__array_module__(types) > if module is not NotImplemented: > return module > raise TypeError("no common array module found") > > def _implementing_arrays_and_types(relevant_arrays): > types = [] > implementing_arrays = [] > for array in relevant_arrays: > t = type(array) > if t not in types and hasattr(t, '__array_module__'): > types.append(t) > # Subclasses before superclasses, otherwise left to right > index = len(implementing_arrays) > for i, old_array in enumerate(implementing_arrays): > if issubclass(t, type(old_array)): > index = i > break > implementing_arrays.insert(index, array) > return implementing_arrays, types > > Relationship with ``__array_ufunc__`` and ``__array_function__`` > ---------------------------------------------------------------- > > These older protocols have distinct use-cases and should remain > =============================================================== > > ``__array_module__`` is intended to resolve limitations of > ``__array_function__``, so it is natural to consider whether it could > entirely > replace ``__array_function__``. This would offer dual benefits: (1) > simplifying > the user-story about how to override NumPy and (2) removing the slowdown > associated with checking for dispatch when calling every NumPy function. > > However, ``__array_module__`` and ``__array_function__`` are pretty > different > from a user perspective: it requires explicit calls to > ``get_array_function``, > rather than simply reusing original ``numpy`` functions. This is probably > fine > for *libraries* that rely on duck-arrays, but may be frustratingly verbose > for > interactive use. > > Some of the dispatching use-cases for ``__array_ufunc__`` are also solved > by > ``__array_module__``, but not all of them. For example, it is still useful > to > be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a > generic way > on non-NumPy arrays (e.g., with dask.array). > > Given their existing adoption and distinct use cases, we don't think it > makes > sense to remove or deprecate ``__array_function__`` and > ``__array_ufunc__`` at > this time. > > Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` > ========================================================================= > > Despite the user-facing differences, ``__array_module__`` and a module > implementing NumPy's API still contain sufficient functionality needed to > implement dispatching with the existing duck array protocols. > > For example, the following mixin classes would provide sensible defaults > for > these special methods in terms of ``get_array_module`` and > ``__array_module__``: > > .. code:: python > > class ArrayUfuncFromModuleMixin: > > def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): > arrays = inputs + kwargs.get('out', ()) > try: > array_module = np.get_array_module(*arrays) > except TypeError: > return NotImplemented > > try: > # Note this may have false positive matches, if > ufunc.__name__ > # matches the name of a ufunc defined by NumPy. > Unfortunately > # there is no way to determine in which module a ufunc was > # defined. > new_ufunc = getattr(array_module, ufunc.__name__) > except AttributeError: > return NotImplemented > > try: > callable = getattr(new_ufunc, method) > except AttributeError: > return NotImplemented > > return callable(*inputs, **kwargs) > > class ArrayFunctionFromModuleMixin: > > def __array_function__(self, func, types, args, kwargs): > array_module = self.__array_module__(types) > if array_module is NotImplemented: > return NotImplemented > > # Traverse submodules to find the appropriate function > modules = func.__module__.split('.') > assert modules[0] == 'numpy' > for submodule in modules[1:]: > module = getattr(module, submodule, None) > new_func = getattr(module, func.__name__, None) > if new_func is None: > return NotImplemented > > return new_func(*args, **kwargs) > > To make it easier to write duck arrays, we could also add these mixin > classes > into ``numpy.lib.mixins`` (but the examples above may suffice). > > Alternatives considered > ----------------------- > > Naming > ====== > > We like the name ``__array_module__`` because it mirrors the existing > ``__array_function__`` and ``__array_ufunc__`` protocols. Another > reasonable > choice could be ``__array_namespace__``. > > It is less clear what the NumPy function that calls this protocol should be > called (``get_array_module`` in this proposal). Some possible alternatives: > ``array_module``, ``common_array_module``, ``resolve_array_module``, > ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, > ``get_duck_array_module``. > > .. _requesting-restricted-subsets: > > Requesting restricted subsets of NumPy's API > ============================================ > > Over time, NumPy has accumulated a very large API surface, with over 600 > attributes in the top level ``numpy`` module alone. It is unlikely that any > duck array library could or would want to implement all of these functions > and > classes, because the frequently used subset of NumPy is much smaller. > > We think it would be useful exercise to define "minimal" subset(s) of > NumPy's > API, omitting rarely used or non-recommended functionality. For example, > minimal NumPy might include ``stack``, but not the other stacking functions > ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could clearly > indicate to duck array authors and users want functionality is core and > what > functionality they can skip. > > Support for requesting a restricted subset of NumPy's API would be a > natural > feature to include in ``get_array_function`` and ``__array_module__``, > e.g., > > .. code:: python > > # array_module is only guaranteed to contain "minimal" NumPy > array_module = np.get_array_module(*arrays, request='minimal') > > To facilitate testing with NumPy and use with any valid duck array library, > NumPy itself would return restricted versions of the ``numpy`` module when > ``get_array_module`` is called only on NumPy arrays. Omitted functions > would > simply not exist. > > Unfortunately, we have not yet figured out what these restricted subsets > should > be, so it doesn't make sense to do this yet. When/if we do, we could > either add > new keyword arguments to ``get_array_module`` or add new top level > functions, > e.g., ``get_minimal_array_module``. We would also need to add either a new > protocol patterned off of ``__array_module__`` (e.g., > ``__array_module_minimal__``), or could add an optional second argument to > ``__array_module__`` (catching errors with ``try``/``except``). > > A new namespace for implicit dispatch > ===================================== > > Instead of supporting overrides in the main `numpy` namespace with > ``__array_function__``, we could create a new opt-in namespace, e.g., > ``numpy.api``, with versions of NumPy functions that support dispatching. > These > overrides would need new opt-in protocols, e.g., ``__array_function_api__`` > patterned off of ``__array_function__``. > > This would resolve the biggest limitations of ``__array_function__`` by > being > opt-in and would also allow for unambiguously overriding functions like > ``asarray``, because ``np.api.asarray`` would always mean "convert an > array-like object." But it wouldn't solve all the dispatching needs met by > ``__array_module__``, and would leave us with supporting a considerably > more > complex protocol both for array users and implementors. > > We could potentially implement such a new namespace *via* the > ``__array_module__`` protocol. Certainly some users would find this > convenient, > because it is slightly less boilerplate. But this would leave users with a > confusing choice: when should they use `get_array_module` vs. > `np.api.something`. Also, we would have to add and maintain a whole new > module, > which is considerably more expensive than merely adding a function. > > Dispatching on both types and arrays instead of only types > ========================================================== > > Instead of supporting dispatch only via unique array types, we could also > support dispatch via array objects, e.g., by passing an ``arrays`` > argument as > part of the ``__array_module__`` protocol. This could potentially be > useful for > dispatch for arrays with metadata, such provided by Dask and Pint, but > would > impose costs in terms of type safety and complexity. > > For example, a library that supports arrays on both CPUs and GPUs might > decide > on which device to create a new arrays from functions like ``ones`` based > on > input arguments: > > .. code:: python > > class Array: > def __array_module__(self, types, arrays): > useful_arrays = tuple(a in arrays if isinstance(a, Array)) > if not useful_arrays: > return NotImplemented > prefer_gpu = any(a.prefer_gpu for a in useful_arrays) > return ArrayModule(prefer_gpu) > > class ArrayModule: > def __init__(self, prefer_gpu): > self.prefer_gpu = prefer_gpu > > def __getattr__(self, name): > import base_module > base_func = getattr(base_module, name) > return functools.partial(base_func, prefer_gpu=self.prefer_gpu) > > This might be useful, but it's not clear if we really need it. Pint seems > to > get along OK without any explicit array creation routines (favoring > multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most > part > Dask is also OK with existing ``__array_function__`` style overides (e.g., > favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an > array > on the CPU or GPU could be solved by `making array creation lazy > <https://github.com/google/jax/pull/1668>`_. > > .. _appendix-design-choices: > > Appendix: design choices for API overrides > ------------------------------------------ > > There is a large range of possible design choices for overriding NumPy's > API. > Here we discuss three major axes of the design decision that guided our > design > for ``__array_module__``. > > Opt-in vs. opt-out for users > ============================ > > The ``__array_ufunc__`` and ``__array_function__`` protocols provide a > mechanism for overriding NumPy functions *within NumPy's existing > namespace*. > This means that users need to explicitly opt-out if they do not want any > overridden behavior, e.g., by casting arrays with ``np.asarray()``. > > In theory, this approach lowers the barrier for adopting these protocols in > user code and libraries, because code that uses the standard NumPy > namespace is > automatically compatible. But in practice, this hasn't worked out. For > example, > most well-maintained libraries that use NumPy follow the best practice of > casting all inputs with ``np.asarray()``, which they would have to > explicitly > relax to use ``__array_function__``. Our experience has been that making a > library compatible with a new duck array type typically requires at least a > small amount of work to accommodate differences in the data model and > operations > that can be implemented efficiently. > > These opt-out approaches also considerably complicate backwards > compatibility > for libraries that adopt these protocols, because by opting in as a library > they also opt-in their users, whether they expect it or not. For winning > over > libraries that have been unable to adopt ``__array_function__``, an opt-in > approach seems like a must. > > Explicit vs. implicit choice of implementation > ============================================== > > Both ``__array_ufunc__`` and ``__array_function__`` have implicit control > over > dispatching: the dispatched functions are determined via the appropriate > protocols in every function call. This generalizes well to handling many > different types of objects, as evidenced by its use for implementing > arithmetic > operators in Python, but it has two downsides: > > 1. *Speed*: it imposes additional overhead in every function call, because > each > function call needs to inspect each of its arguments for overrides. > This is > why arithmetic on builtin Python numbers is slow. > 2. *Readability*: it is not longer immediately evident to readers of code > what > happens when a function is called, because the function's implementation > could be overridden by any of its arguments. > > In contrast, importing a new library (e.g., ``import dask.array as da``) > with > an API matching NumPy is entirely explicit. There is no overhead from > dispatch > or ambiguity about which implementation is being used. > > Explicit and implicit choice of implementations are not mutually exclusive > options. Indeed, most implementations of NumPy API overrides via > ``__array_function__`` that we are familiar with (namely, dask, CuPy and > sparse, but not Pint) also include an explicit way to use their version of > NumPy's API by importing a module directly (``dask.array``, ``cupy`` or > ``sparse``, respectively). > > Local vs. non-local vs. global control > ====================================== > > The final design axis is how users control the choice of API: > > - **Local control**, as exemplified by multiple dispatch and Python > protocols for > arithmetic, determines which implementation to use either by checking > types > or calling methods on the direct arguments of a function. > - **Non-local control** such as `np.errstate > < > https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html > >`_ > overrides behavior with global-state via function decorators or > context-managers. Control is determined hierarchically, via the > inner-most > context. > - **Global control** provides a mechanism for users to set default > behavior, > either via function calls or configuration files. For example, matplotlib > allows setting a global choice of plotting backend. > > Local control is generally considered a best practice for API design, > because > control flow is entirely explicit, which makes it the easiest to > understand. > Non-local and global control are occasionally used, but generally either > due to > ignorance or a lack of better alternatives. > > In the case of duck typing for NumPy's public API, we think non-local or > global > control would be mistakes, mostly because they **don't compose well**. If > one > library sets/needs one set of overrides and then internally calls a routine > that expects another set of overrides, the resulting behavior may be very > surprising. Higher order functions are especially problematic, because the > context in which functions are evaluated may not be the context in which > they > are defined. > > One class of override use cases where we think non-local and global > control are > appropriate is for choosing a backend system that is guaranteed to have an > entirely consistent interface, such as a faster alternative implementation > of > ``numpy.fft`` on NumPy arrays. However, these are out of scope for the > current > proposal, which is focused on duck arrays. > > _______________________________________________ > NumPy-Discussion mailing listNumPy-Discussion@python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
![](https://secure.gravatar.com/avatar/209654202cde8ec709dee0a4d23c717d.jpg?s=120&d=mm&r=g)
> scipy.linalg is a superset of numpy.linalg This isn't completely accurate - numpy.linalg supports almost all operations* over stacks of matrices via gufuncs, but scipy.linalg does not appear to. Eric *: not lstsq due to an ungeneralizable public API On Wed, 5 Feb 2020 at 17:38, Ralf Gommers <ralf.gommers@gmail.com> wrote: > > > On Wed, Feb 5, 2020 at 10:01 AM Andreas Mueller <t3kcit@gmail.com> wrote: > >> A bit late to the NEP 37 party. >> I just wanted to say that at least from my perspective it seems a great >> solution that will help sklearn move towards more flexible compute engines. >> I think one of the biggest issues is array creation (including random >> arrays), and that's handled quite nicely with NEP 37. >> >> There's some discussion on the scikit-learn side here: >> https://github.com/scikit-learn/scikit-learn/pull/14963 >> https://github.com/scikit-learn/scikit-learn/issues/11447 >> >> Two different groups of people tried to use __array_function__ to >> delegate to MxNet and CuPy respectively in scikit-learn, and ran into the >> same issues. >> >> There's some remaining issues in sklearn that will not be handled by NEP >> 37 but they go beyond NumPy in some sense. >> Just to briefly bring them up: >> >> - We use scipy.linalg in many places, and we would need to do a separate >> dispatching to check whether we can use module.linalg instead >> (that might be an issue for many libraries but I'm not sure). >> > > That is an issue, and goes in the opposite direction we need - > scipy.linalg is a superset of numpy.linalg, so we'd like to encourage using > scipy. This is something we may want to consider fixing by making the > dispatch decorator public in numpy and adopting in scipy. > > Cheers, > Ralf > > > >> >> - Some models have several possible optimization algorithms, some of >> which are pure numpy and some which are Cython. If someone provides a >> different array module, >> we might want to choose an algorithm that is actually supported by that >> module. While this exact issue is maybe sklearn specific, a similar issue >> could appear for most downstream libs that use Cython in some places. >> Many Cython algorithms could be implemented in pure numpy with a >> potential slowdown, but once we have NEP 37 there might be a benefit to >> having a pure NumPy implementation as an alternative code path. >> >> >> Anyway, NEP 37 seems a great step in the right direction and would enable >> sklearn to actually dispatch in some places. Dispatching just based on >> __array_function__ seems not really feasible so far. >> >> Best, >> Andreas Mueller >> >> >> On 1/6/20 11:29 PM, Stephan Hoyer wrote: >> >> I am pleased to present a new NumPy Enhancement Proposal for discussion: >> "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be >> very welcome! >> >> The full text follows. The rendered proposal can also be found online at >> https://numpy.org/neps/nep-0037-array-module.html >> >> Best, >> Stephan Hoyer >> >> =================================================== >> NEP 37 — A dispatch protocol for NumPy-like modules >> =================================================== >> >> :Author: Stephan Hoyer <shoyer@google.com> >> :Author: Hameer Abbasi >> :Author: Sebastian Berg >> :Status: Draft >> :Type: Standards Track >> :Created: 2019-12-29 >> >> Abstract >> -------- >> >> NEP-18's ``__array_function__`` has been a mixed success. Some projects >> (e.g., >> dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others >> (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a >> new >> protocol, ``__array_module__``, that we expect could eventually subsume >> most >> use-cases for ``__array_function__``. The protocol requires explicit >> adoption >> by both users and library authors, which ensures backwards compatibility, >> and >> is also significantly simpler than ``__array_function__``, both of which >> we >> expect will make it easier to adopt. >> >> Why ``__array_function__`` hasn't been enough >> --------------------------------------------- >> >> There are two broad ways in which NEP-18 has fallen short of its goals: >> >> 1. **Maintainability concerns**. `__array_function__` has significant >> implications for libraries that use it: >> >> - Projects like `PyTorch >> <https://github.com/pytorch/pytorch/issues/22402>`_, `JAX >> <https://github.com/google/jax/issues/1565>`_ and even `scipy.sparse >> <https://github.com/scipy/scipy/issues/10362>`_ have been reluctant >> to >> implement `__array_function__` in part because they are concerned >> about >> **breaking existing code**: users expect NumPy functions like >> ``np.concatenate`` to return NumPy arrays. This is a fundamental >> limitation of the ``__array_function__`` design, which we chose to >> allow >> overriding the existing ``numpy`` namespace. >> - ``__array_function__`` currently requires an "all or nothing" >> approach to >> implementing NumPy's API. There is no good pathway for **incremental >> adoption**, which is particularly problematic for established >> projects >> for which adopting ``__array_function__`` would result in breaking >> changes. >> - It is no longer possible to use **aliases to NumPy functions** within >> modules that support overrides. For example, both CuPy and JAX set >> ``result_type = np.result_type``. >> - Implementing **fall-back mechanisms** for unimplemented NumPy >> functions >> by using NumPy's implementation is hard to get right (but see the >> `version from dask <https://github.com/dask/dask/pull/5043>`_), >> because >> ``__array_function__`` does not present a consistent interface. >> Converting all arguments of array type requires recursing into >> generic >> arguments of the form ``*args, **kwargs``. >> >> 2. **Limitations on what can be overridden.** ``__array_function__`` has >> some >> important gaps, most notably array creation and coercion functions: >> >> - **Array creation** routines (e.g., ``np.arange`` and those in >> ``np.random``) need some other mechanism for indicating what type of >> arrays to create. `NEP 36 <https://github.com/numpy/numpy/pull/14715 >> >`_ >> proposed adding optional ``like=`` arguments to functions without >> existing array arguments. However, we still lack any mechanism to >> override methods on objects, such as those needed by >> ``np.random.RandomState``. >> - **Array conversion** can't reuse the existing coercion functions like >> ``np.asarray``, because ``np.asarray`` sometimes means "convert to an >> exact ``np.ndarray``" and other times means "convert to something >> _like_ >> a NumPy array." This led to the `NEP 30 >> <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_ >> proposal for >> a separate ``np.duckarray`` function, but this still does not >> resolve how >> to cast one duck array into a type matching another duck array. >> >> ``get_array_module`` and the ``__array_module__`` protocol >> ---------------------------------------------------------- >> >> We propose a new user-facing mechanism for dispatching to a duck-array >> implementation, ``numpy.get_array_module``. ``get_array_module`` performs >> the >> same type resolution as ``__array_function__`` and returns a module with >> an API >> promised to match the standard interface of ``numpy`` that can implement >> operations on all provided array types. >> >> The protocol itself is both simpler and more powerful than >> ``__array_function__``, because it doesn't need to worry about actually >> implementing functions. We believe it resolves most of the >> maintainability and >> functionality limitations of ``__array_function__``. >> >> The new protocol is opt-in, explicit and with local control; see >> :ref:`appendix-design-choices` for discussion on the importance of these >> design >> features. >> >> The array module contract >> ========================= >> >> Modules returned by ``get_array_module``/``__array_module__`` should make >> a >> best effort to implement NumPy's core functionality on new array types(s). >> Unimplemented functionality should simply be omitted (e.g., accessing an >> unimplemented function should raise ``AttributeError``). In the future, we >> anticipate codifying a protocol for requesting restricted subsets of >> ``numpy``; >> see :ref:`requesting-restricted-subsets` for more details. >> >> How to use ``get_array_module`` >> =============================== >> >> Code that wants to support generic duck arrays should explicitly call >> ``get_array_module`` to determine an appropriate array module from which >> to >> call functions, rather than using the ``numpy`` namespace directly. For >> example: >> >> .. code:: python >> >> # calls the appropriate version of np.something for x and y >> module = np.get_array_module(x, y) >> module.something(x, y) >> >> Both array creation and array conversion are supported, because >> dispatching is >> handled by ``get_array_module`` rather than via the types of function >> arguments. For example, to use random number generation functions or >> methods, >> we can simply pull out the appropriate submodule: >> >> .. code:: python >> >> def duckarray_add_random(array): >> module = np.get_array_module(array) >> noise = module.random.randn(*array.shape) >> return array + noise >> >> We can also write the duck-array ``stack`` function from `NEP 30 >> <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_, without >> the need >> for a new ``np.duckarray`` function: >> >> .. code:: python >> >> def duckarray_stack(arrays): >> module = np.get_array_module(*arrays) >> arrays = [module.asarray(arr) for arr in arrays] >> shapes = {arr.shape for arr in arrays} >> if len(shapes) != 1: >> raise ValueError('all input arrays must have the same shape') >> expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] >> return module.concatenate(expanded_arrays, axis=0) >> >> By default, ``get_array_module`` will return the ``numpy`` module if no >> arguments are arrays. This fall-back can be explicitly controlled by >> providing >> the ``module`` keyword-only argument. It is also possible to indicate >> that an >> exception should be raised instead of returning a default array module by >> setting ``module=None``. >> >> How to implement ``__array_module__`` >> ===================================== >> >> Libraries implementing a duck array type that want to support >> ``get_array_module`` need to implement the corresponding protocol, >> ``__array_module__``. This new protocol is based on Python's dispatch >> protocol >> for arithmetic, and is essentially a simpler version of >> ``__array_function__``. >> >> Only one argument is passed into ``__array_module__``, a Python >> collection of >> unique array types passed into ``get_array_module``, i.e., all arguments >> with >> an ``__array_module__`` attribute. >> >> The special method should either return an namespace with an API matching >> ``numpy``, or ``NotImplemented``, indicating that it does not know how to >> handle the operation: >> >> .. code:: python >> >> class MyArray: >> def __array_module__(self, types): >> if not all(issubclass(t, MyArray) for t in types): >> return NotImplemented >> return my_array_module >> >> Returning custom objects from ``__array_module__`` >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> ``my_array_module`` will typically, but need not always, be a Python >> module. >> Returning a custom objects (e.g., with functions implemented via >> ``__getattr__``) may be useful for some advanced use cases. >> >> For example, custom objects could allow for partial implementations of >> duck >> array modules that fall-back to NumPy (although this is not recommended in >> general because such fall-back behavior can be error prone): >> >> .. code:: python >> >> class MyArray: >> def __array_module__(self, types): >> if all(issubclass(t, MyArray) for t in types): >> return ArrayModule() >> else: >> return NotImplemented >> >> class ArrayModule: >> def __getattr__(self, name): >> import base_module >> return getattr(base_module, name, getattr(numpy, name)) >> >> Subclassing from ``numpy.ndarray`` >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> All of the same guidance about well-defined type casting hierarchies from >> NEP-18 still applies. ``numpy.ndarray`` itself contains a matching >> implementation of ``__array_module__``, which is convenient for >> subclasses: >> >> .. code:: python >> >> class ndarray: >> def __array_module__(self, types): >> if all(issubclass(t, ndarray) for t in types): >> return numpy >> else: >> return NotImplemented >> >> NumPy's internal machinery >> ========================== >> >> The type resolution rules of ``get_array_module`` follow the same model as >> Python and NumPy's existing dispatch protocols: subclasses are called >> before >> super-classes, and otherwise left to right. ``__array_module__`` is >> guaranteed >> to be called only a single time on each unique type. >> >> The actual implementation of `get_array_module` will be in C, but should >> be >> equivalent to this Python code: >> >> .. code:: python >> >> def get_array_module(*arrays, default=numpy): >> implementing_arrays, types = >> _implementing_arrays_and_types(arrays) >> if not implementing_arrays and default is not None: >> return default >> for array in implementing_arrays: >> module = array.__array_module__(types) >> if module is not NotImplemented: >> return module >> raise TypeError("no common array module found") >> >> def _implementing_arrays_and_types(relevant_arrays): >> types = [] >> implementing_arrays = [] >> for array in relevant_arrays: >> t = type(array) >> if t not in types and hasattr(t, '__array_module__'): >> types.append(t) >> # Subclasses before superclasses, otherwise left to right >> index = len(implementing_arrays) >> for i, old_array in enumerate(implementing_arrays): >> if issubclass(t, type(old_array)): >> index = i >> break >> implementing_arrays.insert(index, array) >> return implementing_arrays, types >> >> Relationship with ``__array_ufunc__`` and ``__array_function__`` >> ---------------------------------------------------------------- >> >> These older protocols have distinct use-cases and should remain >> =============================================================== >> >> ``__array_module__`` is intended to resolve limitations of >> ``__array_function__``, so it is natural to consider whether it could >> entirely >> replace ``__array_function__``. This would offer dual benefits: (1) >> simplifying >> the user-story about how to override NumPy and (2) removing the slowdown >> associated with checking for dispatch when calling every NumPy function. >> >> However, ``__array_module__`` and ``__array_function__`` are pretty >> different >> from a user perspective: it requires explicit calls to >> ``get_array_function``, >> rather than simply reusing original ``numpy`` functions. This is probably >> fine >> for *libraries* that rely on duck-arrays, but may be frustratingly >> verbose for >> interactive use. >> >> Some of the dispatching use-cases for ``__array_ufunc__`` are also solved >> by >> ``__array_module__``, but not all of them. For example, it is still >> useful to >> be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a >> generic way >> on non-NumPy arrays (e.g., with dask.array). >> >> Given their existing adoption and distinct use cases, we don't think it >> makes >> sense to remove or deprecate ``__array_function__`` and >> ``__array_ufunc__`` at >> this time. >> >> Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` >> ========================================================================= >> >> Despite the user-facing differences, ``__array_module__`` and a module >> implementing NumPy's API still contain sufficient functionality needed to >> implement dispatching with the existing duck array protocols. >> >> For example, the following mixin classes would provide sensible defaults >> for >> these special methods in terms of ``get_array_module`` and >> ``__array_module__``: >> >> .. code:: python >> >> class ArrayUfuncFromModuleMixin: >> >> def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): >> arrays = inputs + kwargs.get('out', ()) >> try: >> array_module = np.get_array_module(*arrays) >> except TypeError: >> return NotImplemented >> >> try: >> # Note this may have false positive matches, if >> ufunc.__name__ >> # matches the name of a ufunc defined by NumPy. >> Unfortunately >> # there is no way to determine in which module a ufunc was >> # defined. >> new_ufunc = getattr(array_module, ufunc.__name__) >> except AttributeError: >> return NotImplemented >> >> try: >> callable = getattr(new_ufunc, method) >> except AttributeError: >> return NotImplemented >> >> return callable(*inputs, **kwargs) >> >> class ArrayFunctionFromModuleMixin: >> >> def __array_function__(self, func, types, args, kwargs): >> array_module = self.__array_module__(types) >> if array_module is NotImplemented: >> return NotImplemented >> >> # Traverse submodules to find the appropriate function >> modules = func.__module__.split('.') >> assert modules[0] == 'numpy' >> for submodule in modules[1:]: >> module = getattr(module, submodule, None) >> new_func = getattr(module, func.__name__, None) >> if new_func is None: >> return NotImplemented >> >> return new_func(*args, **kwargs) >> >> To make it easier to write duck arrays, we could also add these mixin >> classes >> into ``numpy.lib.mixins`` (but the examples above may suffice). >> >> Alternatives considered >> ----------------------- >> >> Naming >> ====== >> >> We like the name ``__array_module__`` because it mirrors the existing >> ``__array_function__`` and ``__array_ufunc__`` protocols. Another >> reasonable >> choice could be ``__array_namespace__``. >> >> It is less clear what the NumPy function that calls this protocol should >> be >> called (``get_array_module`` in this proposal). Some possible >> alternatives: >> ``array_module``, ``common_array_module``, ``resolve_array_module``, >> ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, >> ``get_duck_array_module``. >> >> .. _requesting-restricted-subsets: >> >> Requesting restricted subsets of NumPy's API >> ============================================ >> >> Over time, NumPy has accumulated a very large API surface, with over 600 >> attributes in the top level ``numpy`` module alone. It is unlikely that >> any >> duck array library could or would want to implement all of these >> functions and >> classes, because the frequently used subset of NumPy is much smaller. >> >> We think it would be useful exercise to define "minimal" subset(s) of >> NumPy's >> API, omitting rarely used or non-recommended functionality. For example, >> minimal NumPy might include ``stack``, but not the other stacking >> functions >> ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could >> clearly >> indicate to duck array authors and users want functionality is core and >> what >> functionality they can skip. >> >> Support for requesting a restricted subset of NumPy's API would be a >> natural >> feature to include in ``get_array_function`` and ``__array_module__``, >> e.g., >> >> .. code:: python >> >> # array_module is only guaranteed to contain "minimal" NumPy >> array_module = np.get_array_module(*arrays, request='minimal') >> >> To facilitate testing with NumPy and use with any valid duck array >> library, >> NumPy itself would return restricted versions of the ``numpy`` module when >> ``get_array_module`` is called only on NumPy arrays. Omitted functions >> would >> simply not exist. >> >> Unfortunately, we have not yet figured out what these restricted subsets >> should >> be, so it doesn't make sense to do this yet. When/if we do, we could >> either add >> new keyword arguments to ``get_array_module`` or add new top level >> functions, >> e.g., ``get_minimal_array_module``. We would also need to add either a new >> protocol patterned off of ``__array_module__`` (e.g., >> ``__array_module_minimal__``), or could add an optional second argument to >> ``__array_module__`` (catching errors with ``try``/``except``). >> >> A new namespace for implicit dispatch >> ===================================== >> >> Instead of supporting overrides in the main `numpy` namespace with >> ``__array_function__``, we could create a new opt-in namespace, e.g., >> ``numpy.api``, with versions of NumPy functions that support dispatching. >> These >> overrides would need new opt-in protocols, e.g., >> ``__array_function_api__`` >> patterned off of ``__array_function__``. >> >> This would resolve the biggest limitations of ``__array_function__`` by >> being >> opt-in and would also allow for unambiguously overriding functions like >> ``asarray``, because ``np.api.asarray`` would always mean "convert an >> array-like object." But it wouldn't solve all the dispatching needs met >> by >> ``__array_module__``, and would leave us with supporting a considerably >> more >> complex protocol both for array users and implementors. >> >> We could potentially implement such a new namespace *via* the >> ``__array_module__`` protocol. Certainly some users would find this >> convenient, >> because it is slightly less boilerplate. But this would leave users with a >> confusing choice: when should they use `get_array_module` vs. >> `np.api.something`. Also, we would have to add and maintain a whole new >> module, >> which is considerably more expensive than merely adding a function. >> >> Dispatching on both types and arrays instead of only types >> ========================================================== >> >> Instead of supporting dispatch only via unique array types, we could also >> support dispatch via array objects, e.g., by passing an ``arrays`` >> argument as >> part of the ``__array_module__`` protocol. This could potentially be >> useful for >> dispatch for arrays with metadata, such provided by Dask and Pint, but >> would >> impose costs in terms of type safety and complexity. >> >> For example, a library that supports arrays on both CPUs and GPUs might >> decide >> on which device to create a new arrays from functions like ``ones`` based >> on >> input arguments: >> >> .. code:: python >> >> class Array: >> def __array_module__(self, types, arrays): >> useful_arrays = tuple(a in arrays if isinstance(a, Array)) >> if not useful_arrays: >> return NotImplemented >> prefer_gpu = any(a.prefer_gpu for a in useful_arrays) >> return ArrayModule(prefer_gpu) >> >> class ArrayModule: >> def __init__(self, prefer_gpu): >> self.prefer_gpu = prefer_gpu >> >> def __getattr__(self, name): >> import base_module >> base_func = getattr(base_module, name) >> return functools.partial(base_func, >> prefer_gpu=self.prefer_gpu) >> >> This might be useful, but it's not clear if we really need it. Pint seems >> to >> get along OK without any explicit array creation routines (favoring >> multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most >> part >> Dask is also OK with existing ``__array_function__`` style overides (e.g., >> favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an >> array >> on the CPU or GPU could be solved by `making array creation lazy >> <https://github.com/google/jax/pull/1668>`_. >> >> .. _appendix-design-choices: >> >> Appendix: design choices for API overrides >> ------------------------------------------ >> >> There is a large range of possible design choices for overriding NumPy's >> API. >> Here we discuss three major axes of the design decision that guided our >> design >> for ``__array_module__``. >> >> Opt-in vs. opt-out for users >> ============================ >> >> The ``__array_ufunc__`` and ``__array_function__`` protocols provide a >> mechanism for overriding NumPy functions *within NumPy's existing >> namespace*. >> This means that users need to explicitly opt-out if they do not want any >> overridden behavior, e.g., by casting arrays with ``np.asarray()``. >> >> In theory, this approach lowers the barrier for adopting these protocols >> in >> user code and libraries, because code that uses the standard NumPy >> namespace is >> automatically compatible. But in practice, this hasn't worked out. For >> example, >> most well-maintained libraries that use NumPy follow the best practice of >> casting all inputs with ``np.asarray()``, which they would have to >> explicitly >> relax to use ``__array_function__``. Our experience has been that making a >> library compatible with a new duck array type typically requires at least >> a >> small amount of work to accommodate differences in the data model and >> operations >> that can be implemented efficiently. >> >> These opt-out approaches also considerably complicate backwards >> compatibility >> for libraries that adopt these protocols, because by opting in as a >> library >> they also opt-in their users, whether they expect it or not. For winning >> over >> libraries that have been unable to adopt ``__array_function__``, an opt-in >> approach seems like a must. >> >> Explicit vs. implicit choice of implementation >> ============================================== >> >> Both ``__array_ufunc__`` and ``__array_function__`` have implicit control >> over >> dispatching: the dispatched functions are determined via the appropriate >> protocols in every function call. This generalizes well to handling many >> different types of objects, as evidenced by its use for implementing >> arithmetic >> operators in Python, but it has two downsides: >> >> 1. *Speed*: it imposes additional overhead in every function call, >> because each >> function call needs to inspect each of its arguments for overrides. >> This is >> why arithmetic on builtin Python numbers is slow. >> 2. *Readability*: it is not longer immediately evident to readers of code >> what >> happens when a function is called, because the function's >> implementation >> could be overridden by any of its arguments. >> >> In contrast, importing a new library (e.g., ``import dask.array as da``) >> with >> an API matching NumPy is entirely explicit. There is no overhead from >> dispatch >> or ambiguity about which implementation is being used. >> >> Explicit and implicit choice of implementations are not mutually exclusive >> options. Indeed, most implementations of NumPy API overrides via >> ``__array_function__`` that we are familiar with (namely, dask, CuPy and >> sparse, but not Pint) also include an explicit way to use their version of >> NumPy's API by importing a module directly (``dask.array``, ``cupy`` or >> ``sparse``, respectively). >> >> Local vs. non-local vs. global control >> ====================================== >> >> The final design axis is how users control the choice of API: >> >> - **Local control**, as exemplified by multiple dispatch and Python >> protocols for >> arithmetic, determines which implementation to use either by checking >> types >> or calling methods on the direct arguments of a function. >> - **Non-local control** such as `np.errstate >> < >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html >> >`_ >> overrides behavior with global-state via function decorators or >> context-managers. Control is determined hierarchically, via the >> inner-most >> context. >> - **Global control** provides a mechanism for users to set default >> behavior, >> either via function calls or configuration files. For example, >> matplotlib >> allows setting a global choice of plotting backend. >> >> Local control is generally considered a best practice for API design, >> because >> control flow is entirely explicit, which makes it the easiest to >> understand. >> Non-local and global control are occasionally used, but generally either >> due to >> ignorance or a lack of better alternatives. >> >> In the case of duck typing for NumPy's public API, we think non-local or >> global >> control would be mistakes, mostly because they **don't compose well**. If >> one >> library sets/needs one set of overrides and then internally calls a >> routine >> that expects another set of overrides, the resulting behavior may be very >> surprising. Higher order functions are especially problematic, because the >> context in which functions are evaluated may not be the context in which >> they >> are defined. >> >> One class of override use cases where we think non-local and global >> control are >> appropriate is for choosing a backend system that is guaranteed to have an >> entirely consistent interface, such as a faster alternative >> implementation of >> ``numpy.fft`` on NumPy arrays. However, these are out of scope for the >> current >> proposal, which is focused on duck arrays. >> >> _______________________________________________ >> NumPy-Discussion mailing listNumPy-Discussion@python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Wed, Feb 5, 2020 at 12:14 PM Eric Wieser <wieser.eric+numpy@gmail.com> wrote: > > scipy.linalg is a superset of numpy.linalg > > This isn't completely accurate - numpy.linalg supports almost all > operations* over stacks of matrices via gufuncs, but scipy.linalg does not > appear to. > > Eric > > *: not lstsq due to an ungeneralizable public API > That's true for `qr` as well I believe. Indeed some functions have diverged slightly, but that's not on purpose, more like a lack of time to coordinate. We would like to fix that so everything is in sync and fully API-compatible again. Ralf > On Wed, 5 Feb 2020 at 17:38, Ralf Gommers <ralf.gommers@gmail.com> wrote: > >> >> >> On Wed, Feb 5, 2020 at 10:01 AM Andreas Mueller <t3kcit@gmail.com> wrote: >> >>> A bit late to the NEP 37 party. >>> I just wanted to say that at least from my perspective it seems a great >>> solution that will help sklearn move towards more flexible compute engines. >>> I think one of the biggest issues is array creation (including random >>> arrays), and that's handled quite nicely with NEP 37. >>> >>> There's some discussion on the scikit-learn side here: >>> https://github.com/scikit-learn/scikit-learn/pull/14963 >>> https://github.com/scikit-learn/scikit-learn/issues/11447 >>> >>> Two different groups of people tried to use __array_function__ to >>> delegate to MxNet and CuPy respectively in scikit-learn, and ran into the >>> same issues. >>> >>> There's some remaining issues in sklearn that will not be handled by NEP >>> 37 but they go beyond NumPy in some sense. >>> Just to briefly bring them up: >>> >>> - We use scipy.linalg in many places, and we would need to do a separate >>> dispatching to check whether we can use module.linalg instead >>> (that might be an issue for many libraries but I'm not sure). >>> >> >> That is an issue, and goes in the opposite direction we need - >> scipy.linalg is a superset of numpy.linalg, so we'd like to encourage using >> scipy. This is something we may want to consider fixing by making the >> dispatch decorator public in numpy and adopting in scipy. >> >> Cheers, >> Ralf >> >> >> >>> >>> - Some models have several possible optimization algorithms, some of >>> which are pure numpy and some which are Cython. If someone provides a >>> different array module, >>> we might want to choose an algorithm that is actually supported by that >>> module. While this exact issue is maybe sklearn specific, a similar issue >>> could appear for most downstream libs that use Cython in some places. >>> Many Cython algorithms could be implemented in pure numpy with a >>> potential slowdown, but once we have NEP 37 there might be a benefit to >>> having a pure NumPy implementation as an alternative code path. >>> >>> >>> Anyway, NEP 37 seems a great step in the right direction and would >>> enable sklearn to actually dispatch in some places. Dispatching just based >>> on __array_function__ seems not really feasible so far. >>> >>> Best, >>> Andreas Mueller >>> >>> >>> On 1/6/20 11:29 PM, Stephan Hoyer wrote: >>> >>> I am pleased to present a new NumPy Enhancement Proposal for discussion: >>> "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be >>> very welcome! >>> >>> The full text follows. The rendered proposal can also be found online at >>> https://numpy.org/neps/nep-0037-array-module.html >>> >>> Best, >>> Stephan Hoyer >>> >>> =================================================== >>> NEP 37 — A dispatch protocol for NumPy-like modules >>> =================================================== >>> >>> :Author: Stephan Hoyer <shoyer@google.com> >>> :Author: Hameer Abbasi >>> :Author: Sebastian Berg >>> :Status: Draft >>> :Type: Standards Track >>> :Created: 2019-12-29 >>> >>> Abstract >>> -------- >>> >>> NEP-18's ``__array_function__`` has been a mixed success. Some projects >>> (e.g., >>> dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. >>> Others >>> (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a >>> new >>> protocol, ``__array_module__``, that we expect could eventually subsume >>> most >>> use-cases for ``__array_function__``. The protocol requires explicit >>> adoption >>> by both users and library authors, which ensures backwards >>> compatibility, and >>> is also significantly simpler than ``__array_function__``, both of which >>> we >>> expect will make it easier to adopt. >>> >>> Why ``__array_function__`` hasn't been enough >>> --------------------------------------------- >>> >>> There are two broad ways in which NEP-18 has fallen short of its goals: >>> >>> 1. **Maintainability concerns**. `__array_function__` has significant >>> implications for libraries that use it: >>> >>> - Projects like `PyTorch >>> <https://github.com/pytorch/pytorch/issues/22402>`_, `JAX >>> <https://github.com/google/jax/issues/1565>`_ and even >>> `scipy.sparse >>> <https://github.com/scipy/scipy/issues/10362>`_ have been >>> reluctant to >>> implement `__array_function__` in part because they are concerned >>> about >>> **breaking existing code**: users expect NumPy functions like >>> ``np.concatenate`` to return NumPy arrays. This is a fundamental >>> limitation of the ``__array_function__`` design, which we chose to >>> allow >>> overriding the existing ``numpy`` namespace. >>> - ``__array_function__`` currently requires an "all or nothing" >>> approach to >>> implementing NumPy's API. There is no good pathway for **incremental >>> adoption**, which is particularly problematic for established >>> projects >>> for which adopting ``__array_function__`` would result in breaking >>> changes. >>> - It is no longer possible to use **aliases to NumPy functions** >>> within >>> modules that support overrides. For example, both CuPy and JAX set >>> ``result_type = np.result_type``. >>> - Implementing **fall-back mechanisms** for unimplemented NumPy >>> functions >>> by using NumPy's implementation is hard to get right (but see the >>> `version from dask <https://github.com/dask/dask/pull/5043>`_), >>> because >>> ``__array_function__`` does not present a consistent interface. >>> Converting all arguments of array type requires recursing into >>> generic >>> arguments of the form ``*args, **kwargs``. >>> >>> 2. **Limitations on what can be overridden.** ``__array_function__`` has >>> some >>> important gaps, most notably array creation and coercion functions: >>> >>> - **Array creation** routines (e.g., ``np.arange`` and those in >>> ``np.random``) need some other mechanism for indicating what type of >>> arrays to create. `NEP 36 < >>> https://github.com/numpy/numpy/pull/14715>`_ >>> proposed adding optional ``like=`` arguments to functions without >>> existing array arguments. However, we still lack any mechanism to >>> override methods on objects, such as those needed by >>> ``np.random.RandomState``. >>> - **Array conversion** can't reuse the existing coercion functions >>> like >>> ``np.asarray``, because ``np.asarray`` sometimes means "convert to >>> an >>> exact ``np.ndarray``" and other times means "convert to something >>> _like_ >>> a NumPy array." This led to the `NEP 30 >>> <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_ >>> proposal for >>> a separate ``np.duckarray`` function, but this still does not >>> resolve how >>> to cast one duck array into a type matching another duck array. >>> >>> ``get_array_module`` and the ``__array_module__`` protocol >>> ---------------------------------------------------------- >>> >>> We propose a new user-facing mechanism for dispatching to a duck-array >>> implementation, ``numpy.get_array_module``. ``get_array_module`` >>> performs the >>> same type resolution as ``__array_function__`` and returns a module with >>> an API >>> promised to match the standard interface of ``numpy`` that can implement >>> operations on all provided array types. >>> >>> The protocol itself is both simpler and more powerful than >>> ``__array_function__``, because it doesn't need to worry about actually >>> implementing functions. We believe it resolves most of the >>> maintainability and >>> functionality limitations of ``__array_function__``. >>> >>> The new protocol is opt-in, explicit and with local control; see >>> :ref:`appendix-design-choices` for discussion on the importance of these >>> design >>> features. >>> >>> The array module contract >>> ========================= >>> >>> Modules returned by ``get_array_module``/``__array_module__`` should >>> make a >>> best effort to implement NumPy's core functionality on new array >>> types(s). >>> Unimplemented functionality should simply be omitted (e.g., accessing an >>> unimplemented function should raise ``AttributeError``). In the future, >>> we >>> anticipate codifying a protocol for requesting restricted subsets of >>> ``numpy``; >>> see :ref:`requesting-restricted-subsets` for more details. >>> >>> How to use ``get_array_module`` >>> =============================== >>> >>> Code that wants to support generic duck arrays should explicitly call >>> ``get_array_module`` to determine an appropriate array module from which >>> to >>> call functions, rather than using the ``numpy`` namespace directly. For >>> example: >>> >>> .. code:: python >>> >>> # calls the appropriate version of np.something for x and y >>> module = np.get_array_module(x, y) >>> module.something(x, y) >>> >>> Both array creation and array conversion are supported, because >>> dispatching is >>> handled by ``get_array_module`` rather than via the types of function >>> arguments. For example, to use random number generation functions or >>> methods, >>> we can simply pull out the appropriate submodule: >>> >>> .. code:: python >>> >>> def duckarray_add_random(array): >>> module = np.get_array_module(array) >>> noise = module.random.randn(*array.shape) >>> return array + noise >>> >>> We can also write the duck-array ``stack`` function from `NEP 30 >>> <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_, without >>> the need >>> for a new ``np.duckarray`` function: >>> >>> .. code:: python >>> >>> def duckarray_stack(arrays): >>> module = np.get_array_module(*arrays) >>> arrays = [module.asarray(arr) for arr in arrays] >>> shapes = {arr.shape for arr in arrays} >>> if len(shapes) != 1: >>> raise ValueError('all input arrays must have the same shape') >>> expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] >>> return module.concatenate(expanded_arrays, axis=0) >>> >>> By default, ``get_array_module`` will return the ``numpy`` module if no >>> arguments are arrays. This fall-back can be explicitly controlled by >>> providing >>> the ``module`` keyword-only argument. It is also possible to indicate >>> that an >>> exception should be raised instead of returning a default array module by >>> setting ``module=None``. >>> >>> How to implement ``__array_module__`` >>> ===================================== >>> >>> Libraries implementing a duck array type that want to support >>> ``get_array_module`` need to implement the corresponding protocol, >>> ``__array_module__``. This new protocol is based on Python's dispatch >>> protocol >>> for arithmetic, and is essentially a simpler version of >>> ``__array_function__``. >>> >>> Only one argument is passed into ``__array_module__``, a Python >>> collection of >>> unique array types passed into ``get_array_module``, i.e., all arguments >>> with >>> an ``__array_module__`` attribute. >>> >>> The special method should either return an namespace with an API matching >>> ``numpy``, or ``NotImplemented``, indicating that it does not know how to >>> handle the operation: >>> >>> .. code:: python >>> >>> class MyArray: >>> def __array_module__(self, types): >>> if not all(issubclass(t, MyArray) for t in types): >>> return NotImplemented >>> return my_array_module >>> >>> Returning custom objects from ``__array_module__`` >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> ``my_array_module`` will typically, but need not always, be a Python >>> module. >>> Returning a custom objects (e.g., with functions implemented via >>> ``__getattr__``) may be useful for some advanced use cases. >>> >>> For example, custom objects could allow for partial implementations of >>> duck >>> array modules that fall-back to NumPy (although this is not recommended >>> in >>> general because such fall-back behavior can be error prone): >>> >>> .. code:: python >>> >>> class MyArray: >>> def __array_module__(self, types): >>> if all(issubclass(t, MyArray) for t in types): >>> return ArrayModule() >>> else: >>> return NotImplemented >>> >>> class ArrayModule: >>> def __getattr__(self, name): >>> import base_module >>> return getattr(base_module, name, getattr(numpy, name)) >>> >>> Subclassing from ``numpy.ndarray`` >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> All of the same guidance about well-defined type casting hierarchies from >>> NEP-18 still applies. ``numpy.ndarray`` itself contains a matching >>> implementation of ``__array_module__``, which is convenient for >>> subclasses: >>> >>> .. code:: python >>> >>> class ndarray: >>> def __array_module__(self, types): >>> if all(issubclass(t, ndarray) for t in types): >>> return numpy >>> else: >>> return NotImplemented >>> >>> NumPy's internal machinery >>> ========================== >>> >>> The type resolution rules of ``get_array_module`` follow the same model >>> as >>> Python and NumPy's existing dispatch protocols: subclasses are called >>> before >>> super-classes, and otherwise left to right. ``__array_module__`` is >>> guaranteed >>> to be called only a single time on each unique type. >>> >>> The actual implementation of `get_array_module` will be in C, but should >>> be >>> equivalent to this Python code: >>> >>> .. code:: python >>> >>> def get_array_module(*arrays, default=numpy): >>> implementing_arrays, types = >>> _implementing_arrays_and_types(arrays) >>> if not implementing_arrays and default is not None: >>> return default >>> for array in implementing_arrays: >>> module = array.__array_module__(types) >>> if module is not NotImplemented: >>> return module >>> raise TypeError("no common array module found") >>> >>> def _implementing_arrays_and_types(relevant_arrays): >>> types = [] >>> implementing_arrays = [] >>> for array in relevant_arrays: >>> t = type(array) >>> if t not in types and hasattr(t, '__array_module__'): >>> types.append(t) >>> # Subclasses before superclasses, otherwise left to right >>> index = len(implementing_arrays) >>> for i, old_array in enumerate(implementing_arrays): >>> if issubclass(t, type(old_array)): >>> index = i >>> break >>> implementing_arrays.insert(index, array) >>> return implementing_arrays, types >>> >>> Relationship with ``__array_ufunc__`` and ``__array_function__`` >>> ---------------------------------------------------------------- >>> >>> These older protocols have distinct use-cases and should remain >>> =============================================================== >>> >>> ``__array_module__`` is intended to resolve limitations of >>> ``__array_function__``, so it is natural to consider whether it could >>> entirely >>> replace ``__array_function__``. This would offer dual benefits: (1) >>> simplifying >>> the user-story about how to override NumPy and (2) removing the slowdown >>> associated with checking for dispatch when calling every NumPy function. >>> >>> However, ``__array_module__`` and ``__array_function__`` are pretty >>> different >>> from a user perspective: it requires explicit calls to >>> ``get_array_function``, >>> rather than simply reusing original ``numpy`` functions. This is >>> probably fine >>> for *libraries* that rely on duck-arrays, but may be frustratingly >>> verbose for >>> interactive use. >>> >>> Some of the dispatching use-cases for ``__array_ufunc__`` are also >>> solved by >>> ``__array_module__``, but not all of them. For example, it is still >>> useful to >>> be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a >>> generic way >>> on non-NumPy arrays (e.g., with dask.array). >>> >>> Given their existing adoption and distinct use cases, we don't think it >>> makes >>> sense to remove or deprecate ``__array_function__`` and >>> ``__array_ufunc__`` at >>> this time. >>> >>> Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` >>> ========================================================================= >>> >>> Despite the user-facing differences, ``__array_module__`` and a module >>> implementing NumPy's API still contain sufficient functionality needed to >>> implement dispatching with the existing duck array protocols. >>> >>> For example, the following mixin classes would provide sensible defaults >>> for >>> these special methods in terms of ``get_array_module`` and >>> ``__array_module__``: >>> >>> .. code:: python >>> >>> class ArrayUfuncFromModuleMixin: >>> >>> def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): >>> arrays = inputs + kwargs.get('out', ()) >>> try: >>> array_module = np.get_array_module(*arrays) >>> except TypeError: >>> return NotImplemented >>> >>> try: >>> # Note this may have false positive matches, if >>> ufunc.__name__ >>> # matches the name of a ufunc defined by NumPy. >>> Unfortunately >>> # there is no way to determine in which module a ufunc >>> was >>> # defined. >>> new_ufunc = getattr(array_module, ufunc.__name__) >>> except AttributeError: >>> return NotImplemented >>> >>> try: >>> callable = getattr(new_ufunc, method) >>> except AttributeError: >>> return NotImplemented >>> >>> return callable(*inputs, **kwargs) >>> >>> class ArrayFunctionFromModuleMixin: >>> >>> def __array_function__(self, func, types, args, kwargs): >>> array_module = self.__array_module__(types) >>> if array_module is NotImplemented: >>> return NotImplemented >>> >>> # Traverse submodules to find the appropriate function >>> modules = func.__module__.split('.') >>> assert modules[0] == 'numpy' >>> for submodule in modules[1:]: >>> module = getattr(module, submodule, None) >>> new_func = getattr(module, func.__name__, None) >>> if new_func is None: >>> return NotImplemented >>> >>> return new_func(*args, **kwargs) >>> >>> To make it easier to write duck arrays, we could also add these mixin >>> classes >>> into ``numpy.lib.mixins`` (but the examples above may suffice). >>> >>> Alternatives considered >>> ----------------------- >>> >>> Naming >>> ====== >>> >>> We like the name ``__array_module__`` because it mirrors the existing >>> ``__array_function__`` and ``__array_ufunc__`` protocols. Another >>> reasonable >>> choice could be ``__array_namespace__``. >>> >>> It is less clear what the NumPy function that calls this protocol should >>> be >>> called (``get_array_module`` in this proposal). Some possible >>> alternatives: >>> ``array_module``, ``common_array_module``, ``resolve_array_module``, >>> ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, >>> ``get_duck_array_module``. >>> >>> .. _requesting-restricted-subsets: >>> >>> Requesting restricted subsets of NumPy's API >>> ============================================ >>> >>> Over time, NumPy has accumulated a very large API surface, with over 600 >>> attributes in the top level ``numpy`` module alone. It is unlikely that >>> any >>> duck array library could or would want to implement all of these >>> functions and >>> classes, because the frequently used subset of NumPy is much smaller. >>> >>> We think it would be useful exercise to define "minimal" subset(s) of >>> NumPy's >>> API, omitting rarely used or non-recommended functionality. For example, >>> minimal NumPy might include ``stack``, but not the other stacking >>> functions >>> ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could >>> clearly >>> indicate to duck array authors and users want functionality is core and >>> what >>> functionality they can skip. >>> >>> Support for requesting a restricted subset of NumPy's API would be a >>> natural >>> feature to include in ``get_array_function`` and ``__array_module__``, >>> e.g., >>> >>> .. code:: python >>> >>> # array_module is only guaranteed to contain "minimal" NumPy >>> array_module = np.get_array_module(*arrays, request='minimal') >>> >>> To facilitate testing with NumPy and use with any valid duck array >>> library, >>> NumPy itself would return restricted versions of the ``numpy`` module >>> when >>> ``get_array_module`` is called only on NumPy arrays. Omitted functions >>> would >>> simply not exist. >>> >>> Unfortunately, we have not yet figured out what these restricted subsets >>> should >>> be, so it doesn't make sense to do this yet. When/if we do, we could >>> either add >>> new keyword arguments to ``get_array_module`` or add new top level >>> functions, >>> e.g., ``get_minimal_array_module``. We would also need to add either a >>> new >>> protocol patterned off of ``__array_module__`` (e.g., >>> ``__array_module_minimal__``), or could add an optional second argument >>> to >>> ``__array_module__`` (catching errors with ``try``/``except``). >>> >>> A new namespace for implicit dispatch >>> ===================================== >>> >>> Instead of supporting overrides in the main `numpy` namespace with >>> ``__array_function__``, we could create a new opt-in namespace, e.g., >>> ``numpy.api``, with versions of NumPy functions that support >>> dispatching. These >>> overrides would need new opt-in protocols, e.g., >>> ``__array_function_api__`` >>> patterned off of ``__array_function__``. >>> >>> This would resolve the biggest limitations of ``__array_function__`` by >>> being >>> opt-in and would also allow for unambiguously overriding functions like >>> ``asarray``, because ``np.api.asarray`` would always mean "convert an >>> array-like object." But it wouldn't solve all the dispatching needs met >>> by >>> ``__array_module__``, and would leave us with supporting a considerably >>> more >>> complex protocol both for array users and implementors. >>> >>> We could potentially implement such a new namespace *via* the >>> ``__array_module__`` protocol. Certainly some users would find this >>> convenient, >>> because it is slightly less boilerplate. But this would leave users with >>> a >>> confusing choice: when should they use `get_array_module` vs. >>> `np.api.something`. Also, we would have to add and maintain a whole new >>> module, >>> which is considerably more expensive than merely adding a function. >>> >>> Dispatching on both types and arrays instead of only types >>> ========================================================== >>> >>> Instead of supporting dispatch only via unique array types, we could also >>> support dispatch via array objects, e.g., by passing an ``arrays`` >>> argument as >>> part of the ``__array_module__`` protocol. This could potentially be >>> useful for >>> dispatch for arrays with metadata, such provided by Dask and Pint, but >>> would >>> impose costs in terms of type safety and complexity. >>> >>> For example, a library that supports arrays on both CPUs and GPUs might >>> decide >>> on which device to create a new arrays from functions like ``ones`` >>> based on >>> input arguments: >>> >>> .. code:: python >>> >>> class Array: >>> def __array_module__(self, types, arrays): >>> useful_arrays = tuple(a in arrays if isinstance(a, Array)) >>> if not useful_arrays: >>> return NotImplemented >>> prefer_gpu = any(a.prefer_gpu for a in useful_arrays) >>> return ArrayModule(prefer_gpu) >>> >>> class ArrayModule: >>> def __init__(self, prefer_gpu): >>> self.prefer_gpu = prefer_gpu >>> >>> def __getattr__(self, name): >>> import base_module >>> base_func = getattr(base_module, name) >>> return functools.partial(base_func, >>> prefer_gpu=self.prefer_gpu) >>> >>> This might be useful, but it's not clear if we really need it. Pint >>> seems to >>> get along OK without any explicit array creation routines (favoring >>> multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the >>> most part >>> Dask is also OK with existing ``__array_function__`` style overides >>> (e.g., >>> favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place >>> an array >>> on the CPU or GPU could be solved by `making array creation lazy >>> <https://github.com/google/jax/pull/1668>`_. >>> >>> .. _appendix-design-choices: >>> >>> Appendix: design choices for API overrides >>> ------------------------------------------ >>> >>> There is a large range of possible design choices for overriding NumPy's >>> API. >>> Here we discuss three major axes of the design decision that guided our >>> design >>> for ``__array_module__``. >>> >>> Opt-in vs. opt-out for users >>> ============================ >>> >>> The ``__array_ufunc__`` and ``__array_function__`` protocols provide a >>> mechanism for overriding NumPy functions *within NumPy's existing >>> namespace*. >>> This means that users need to explicitly opt-out if they do not want any >>> overridden behavior, e.g., by casting arrays with ``np.asarray()``. >>> >>> In theory, this approach lowers the barrier for adopting these protocols >>> in >>> user code and libraries, because code that uses the standard NumPy >>> namespace is >>> automatically compatible. But in practice, this hasn't worked out. For >>> example, >>> most well-maintained libraries that use NumPy follow the best practice of >>> casting all inputs with ``np.asarray()``, which they would have to >>> explicitly >>> relax to use ``__array_function__``. Our experience has been that making >>> a >>> library compatible with a new duck array type typically requires at >>> least a >>> small amount of work to accommodate differences in the data model and >>> operations >>> that can be implemented efficiently. >>> >>> These opt-out approaches also considerably complicate backwards >>> compatibility >>> for libraries that adopt these protocols, because by opting in as a >>> library >>> they also opt-in their users, whether they expect it or not. For winning >>> over >>> libraries that have been unable to adopt ``__array_function__``, an >>> opt-in >>> approach seems like a must. >>> >>> Explicit vs. implicit choice of implementation >>> ============================================== >>> >>> Both ``__array_ufunc__`` and ``__array_function__`` have implicit >>> control over >>> dispatching: the dispatched functions are determined via the appropriate >>> protocols in every function call. This generalizes well to handling many >>> different types of objects, as evidenced by its use for implementing >>> arithmetic >>> operators in Python, but it has two downsides: >>> >>> 1. *Speed*: it imposes additional overhead in every function call, >>> because each >>> function call needs to inspect each of its arguments for overrides. >>> This is >>> why arithmetic on builtin Python numbers is slow. >>> 2. *Readability*: it is not longer immediately evident to readers of >>> code what >>> happens when a function is called, because the function's >>> implementation >>> could be overridden by any of its arguments. >>> >>> In contrast, importing a new library (e.g., ``import dask.array as >>> da``) with >>> an API matching NumPy is entirely explicit. There is no overhead from >>> dispatch >>> or ambiguity about which implementation is being used. >>> >>> Explicit and implicit choice of implementations are not mutually >>> exclusive >>> options. Indeed, most implementations of NumPy API overrides via >>> ``__array_function__`` that we are familiar with (namely, dask, CuPy and >>> sparse, but not Pint) also include an explicit way to use their version >>> of >>> NumPy's API by importing a module directly (``dask.array``, ``cupy`` or >>> ``sparse``, respectively). >>> >>> Local vs. non-local vs. global control >>> ====================================== >>> >>> The final design axis is how users control the choice of API: >>> >>> - **Local control**, as exemplified by multiple dispatch and Python >>> protocols for >>> arithmetic, determines which implementation to use either by checking >>> types >>> or calling methods on the direct arguments of a function. >>> - **Non-local control** such as `np.errstate >>> < >>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html >>> >`_ >>> overrides behavior with global-state via function decorators or >>> context-managers. Control is determined hierarchically, via the >>> inner-most >>> context. >>> - **Global control** provides a mechanism for users to set default >>> behavior, >>> either via function calls or configuration files. For example, >>> matplotlib >>> allows setting a global choice of plotting backend. >>> >>> Local control is generally considered a best practice for API design, >>> because >>> control flow is entirely explicit, which makes it the easiest to >>> understand. >>> Non-local and global control are occasionally used, but generally either >>> due to >>> ignorance or a lack of better alternatives. >>> >>> In the case of duck typing for NumPy's public API, we think non-local or >>> global >>> control would be mistakes, mostly because they **don't compose well**. >>> If one >>> library sets/needs one set of overrides and then internally calls a >>> routine >>> that expects another set of overrides, the resulting behavior may be very >>> surprising. Higher order functions are especially problematic, because >>> the >>> context in which functions are evaluated may not be the context in which >>> they >>> are defined. >>> >>> One class of override use cases where we think non-local and global >>> control are >>> appropriate is for choosing a backend system that is guaranteed to have >>> an >>> entirely consistent interface, such as a faster alternative >>> implementation of >>> ``numpy.fft`` on NumPy arrays. However, these are out of scope for the >>> current >>> proposal, which is focused on duck arrays. >>> >>> _______________________________________________ >>> NumPy-Discussion mailing listNumPy-Discussion@python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
![](https://secure.gravatar.com/avatar/93a76a800ef6c5919baa8ba91120ee98.jpg?s=120&d=mm&r=g)
On Wed, Feb 5, 2020 at 8:02 AM Andreas Mueller <t3kcit@gmail.com> wrote: > A bit late to the NEP 37 party. > I just wanted to say that at least from my perspective it seems a great > solution that will help sklearn move towards more flexible compute engines. > I think one of the biggest issues is array creation (including random > arrays), and that's handled quite nicely with NEP 37. > Andreas, thanks for sharing your feedback here! Your perspective is really appreciated. > - We use scipy.linalg in many places, and we would need to do a separate > dispatching to check whether we can use module.linalg instead > (that might be an issue for many libraries but I'm not sure). > This brings up a good question -- obviously the final decision here is up to SciPy maintainers, but how should we encourage SciPy to support dispatching? We could pretty easily make __array_function__ cover SciPy by simply exposing NumPy's internal utilities. SciPy could simply use the np.array_function_dispatch decorator internally and that would be enough. It is less clear how this could work for __array_module__, because __array_module__ and get_array_module() are not generic -- they refers explicitly to a NumPy like module. If we want to extend it to SciPy (for which I agree there are good use-cases), what should that look like? The obvious choices would be to either add a new protocol, e.g., __scipy_module__ (but then NumPy needs to know about SciPy), or to add some sort of "module request" parameter to np.get_array_module(), to indicate the requested API, e.g., np.get_array_module(*arrays, matching='scipy'). This is pretty similar to the "default" argument but would need to get passed into the __array_module__ protocol, too. > - Some models have several possible optimization algorithms, some of which > are pure numpy and some which are Cython. If someone provides a different > array module, > we might want to choose an algorithm that is actually supported by that > module. While this exact issue is maybe sklearn specific, a similar issue > could appear for most downstream libs that use Cython in some places. > Many Cython algorithms could be implemented in pure numpy with a > potential slowdown, but once we have NEP 37 there might be a benefit to > having a pure NumPy implementation as an alternative code path. > > > Anyway, NEP 37 seems a great step in the right direction and would enable > sklearn to actually dispatch in some places. Dispatching just based on > __array_function__ seems not really feasible so far. > > Best, > Andreas Mueller > > > On 1/6/20 11:29 PM, Stephan Hoyer wrote: > > I am pleased to present a new NumPy Enhancement Proposal for discussion: > "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be > very welcome! > > The full text follows. The rendered proposal can also be found online at > https://numpy.org/neps/nep-0037-array-module.html > > Best, > Stephan Hoyer > > =================================================== > NEP 37 — A dispatch protocol for NumPy-like modules > =================================================== > > :Author: Stephan Hoyer <shoyer@google.com> > :Author: Hameer Abbasi > :Author: Sebastian Berg > :Status: Draft > :Type: Standards Track > :Created: 2019-12-29 > > Abstract > -------- > > NEP-18's ``__array_function__`` has been a mixed success. Some projects > (e.g., > dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others > (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a new > protocol, ``__array_module__``, that we expect could eventually subsume > most > use-cases for ``__array_function__``. The protocol requires explicit > adoption > by both users and library authors, which ensures backwards compatibility, > and > is also significantly simpler than ``__array_function__``, both of which we > expect will make it easier to adopt. > > Why ``__array_function__`` hasn't been enough > --------------------------------------------- > > There are two broad ways in which NEP-18 has fallen short of its goals: > > 1. **Maintainability concerns**. `__array_function__` has significant > implications for libraries that use it: > > - Projects like `PyTorch > <https://github.com/pytorch/pytorch/issues/22402>`_, `JAX > <https://github.com/google/jax/issues/1565>`_ and even `scipy.sparse > <https://github.com/scipy/scipy/issues/10362>`_ have been reluctant > to > implement `__array_function__` in part because they are concerned > about > **breaking existing code**: users expect NumPy functions like > ``np.concatenate`` to return NumPy arrays. This is a fundamental > limitation of the ``__array_function__`` design, which we chose to > allow > overriding the existing ``numpy`` namespace. > - ``__array_function__`` currently requires an "all or nothing" > approach to > implementing NumPy's API. There is no good pathway for **incremental > adoption**, which is particularly problematic for established projects > for which adopting ``__array_function__`` would result in breaking > changes. > - It is no longer possible to use **aliases to NumPy functions** within > modules that support overrides. For example, both CuPy and JAX set > ``result_type = np.result_type``. > - Implementing **fall-back mechanisms** for unimplemented NumPy > functions > by using NumPy's implementation is hard to get right (but see the > `version from dask <https://github.com/dask/dask/pull/5043>`_), > because > ``__array_function__`` does not present a consistent interface. > Converting all arguments of array type requires recursing into generic > arguments of the form ``*args, **kwargs``. > > 2. **Limitations on what can be overridden.** ``__array_function__`` has > some > important gaps, most notably array creation and coercion functions: > > - **Array creation** routines (e.g., ``np.arange`` and those in > ``np.random``) need some other mechanism for indicating what type of > arrays to create. `NEP 36 <https://github.com/numpy/numpy/pull/14715 > >`_ > proposed adding optional ``like=`` arguments to functions without > existing array arguments. However, we still lack any mechanism to > override methods on objects, such as those needed by > ``np.random.RandomState``. > - **Array conversion** can't reuse the existing coercion functions like > ``np.asarray``, because ``np.asarray`` sometimes means "convert to an > exact ``np.ndarray``" and other times means "convert to something > _like_ > a NumPy array." This led to the `NEP 30 > <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_ > proposal for > a separate ``np.duckarray`` function, but this still does not resolve > how > to cast one duck array into a type matching another duck array. > > ``get_array_module`` and the ``__array_module__`` protocol > ---------------------------------------------------------- > > We propose a new user-facing mechanism for dispatching to a duck-array > implementation, ``numpy.get_array_module``. ``get_array_module`` performs > the > same type resolution as ``__array_function__`` and returns a module with > an API > promised to match the standard interface of ``numpy`` that can implement > operations on all provided array types. > > The protocol itself is both simpler and more powerful than > ``__array_function__``, because it doesn't need to worry about actually > implementing functions. We believe it resolves most of the maintainability > and > functionality limitations of ``__array_function__``. > > The new protocol is opt-in, explicit and with local control; see > :ref:`appendix-design-choices` for discussion on the importance of these > design > features. > > The array module contract > ========================= > > Modules returned by ``get_array_module``/``__array_module__`` should make a > best effort to implement NumPy's core functionality on new array types(s). > Unimplemented functionality should simply be omitted (e.g., accessing an > unimplemented function should raise ``AttributeError``). In the future, we > anticipate codifying a protocol for requesting restricted subsets of > ``numpy``; > see :ref:`requesting-restricted-subsets` for more details. > > How to use ``get_array_module`` > =============================== > > Code that wants to support generic duck arrays should explicitly call > ``get_array_module`` to determine an appropriate array module from which to > call functions, rather than using the ``numpy`` namespace directly. For > example: > > .. code:: python > > # calls the appropriate version of np.something for x and y > module = np.get_array_module(x, y) > module.something(x, y) > > Both array creation and array conversion are supported, because > dispatching is > handled by ``get_array_module`` rather than via the types of function > arguments. For example, to use random number generation functions or > methods, > we can simply pull out the appropriate submodule: > > .. code:: python > > def duckarray_add_random(array): > module = np.get_array_module(array) > noise = module.random.randn(*array.shape) > return array + noise > > We can also write the duck-array ``stack`` function from `NEP 30 > <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_, without the > need > for a new ``np.duckarray`` function: > > .. code:: python > > def duckarray_stack(arrays): > module = np.get_array_module(*arrays) > arrays = [module.asarray(arr) for arr in arrays] > shapes = {arr.shape for arr in arrays} > if len(shapes) != 1: > raise ValueError('all input arrays must have the same shape') > expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] > return module.concatenate(expanded_arrays, axis=0) > > By default, ``get_array_module`` will return the ``numpy`` module if no > arguments are arrays. This fall-back can be explicitly controlled by > providing > the ``module`` keyword-only argument. It is also possible to indicate that > an > exception should be raised instead of returning a default array module by > setting ``module=None``. > > How to implement ``__array_module__`` > ===================================== > > Libraries implementing a duck array type that want to support > ``get_array_module`` need to implement the corresponding protocol, > ``__array_module__``. This new protocol is based on Python's dispatch > protocol > for arithmetic, and is essentially a simpler version of > ``__array_function__``. > > Only one argument is passed into ``__array_module__``, a Python collection > of > unique array types passed into ``get_array_module``, i.e., all arguments > with > an ``__array_module__`` attribute. > > The special method should either return an namespace with an API matching > ``numpy``, or ``NotImplemented``, indicating that it does not know how to > handle the operation: > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if not all(issubclass(t, MyArray) for t in types): > return NotImplemented > return my_array_module > > Returning custom objects from ``__array_module__`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ``my_array_module`` will typically, but need not always, be a Python > module. > Returning a custom objects (e.g., with functions implemented via > ``__getattr__``) may be useful for some advanced use cases. > > For example, custom objects could allow for partial implementations of duck > array modules that fall-back to NumPy (although this is not recommended in > general because such fall-back behavior can be error prone): > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if all(issubclass(t, MyArray) for t in types): > return ArrayModule() > else: > return NotImplemented > > class ArrayModule: > def __getattr__(self, name): > import base_module > return getattr(base_module, name, getattr(numpy, name)) > > Subclassing from ``numpy.ndarray`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > All of the same guidance about well-defined type casting hierarchies from > NEP-18 still applies. ``numpy.ndarray`` itself contains a matching > implementation of ``__array_module__``, which is convenient for > subclasses: > > .. code:: python > > class ndarray: > def __array_module__(self, types): > if all(issubclass(t, ndarray) for t in types): > return numpy > else: > return NotImplemented > > NumPy's internal machinery > ========================== > > The type resolution rules of ``get_array_module`` follow the same model as > Python and NumPy's existing dispatch protocols: subclasses are called > before > super-classes, and otherwise left to right. ``__array_module__`` is > guaranteed > to be called only a single time on each unique type. > > The actual implementation of `get_array_module` will be in C, but should be > equivalent to this Python code: > > .. code:: python > > def get_array_module(*arrays, default=numpy): > implementing_arrays, types = _implementing_arrays_and_types(arrays) > if not implementing_arrays and default is not None: > return default > for array in implementing_arrays: > module = array.__array_module__(types) > if module is not NotImplemented: > return module > raise TypeError("no common array module found") > > def _implementing_arrays_and_types(relevant_arrays): > types = [] > implementing_arrays = [] > for array in relevant_arrays: > t = type(array) > if t not in types and hasattr(t, '__array_module__'): > types.append(t) > # Subclasses before superclasses, otherwise left to right > index = len(implementing_arrays) > for i, old_array in enumerate(implementing_arrays): > if issubclass(t, type(old_array)): > index = i > break > implementing_arrays.insert(index, array) > return implementing_arrays, types > > Relationship with ``__array_ufunc__`` and ``__array_function__`` > ---------------------------------------------------------------- > > These older protocols have distinct use-cases and should remain > =============================================================== > > ``__array_module__`` is intended to resolve limitations of > ``__array_function__``, so it is natural to consider whether it could > entirely > replace ``__array_function__``. This would offer dual benefits: (1) > simplifying > the user-story about how to override NumPy and (2) removing the slowdown > associated with checking for dispatch when calling every NumPy function. > > However, ``__array_module__`` and ``__array_function__`` are pretty > different > from a user perspective: it requires explicit calls to > ``get_array_function``, > rather than simply reusing original ``numpy`` functions. This is probably > fine > for *libraries* that rely on duck-arrays, but may be frustratingly verbose > for > interactive use. > > Some of the dispatching use-cases for ``__array_ufunc__`` are also solved > by > ``__array_module__``, but not all of them. For example, it is still useful > to > be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a > generic way > on non-NumPy arrays (e.g., with dask.array). > > Given their existing adoption and distinct use cases, we don't think it > makes > sense to remove or deprecate ``__array_function__`` and > ``__array_ufunc__`` at > this time. > > Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` > ========================================================================= > > Despite the user-facing differences, ``__array_module__`` and a module > implementing NumPy's API still contain sufficient functionality needed to > implement dispatching with the existing duck array protocols. > > For example, the following mixin classes would provide sensible defaults > for > these special methods in terms of ``get_array_module`` and > ``__array_module__``: > > .. code:: python > > class ArrayUfuncFromModuleMixin: > > def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): > arrays = inputs + kwargs.get('out', ()) > try: > array_module = np.get_array_module(*arrays) > except TypeError: > return NotImplemented > > try: > # Note this may have false positive matches, if > ufunc.__name__ > # matches the name of a ufunc defined by NumPy. > Unfortunately > # there is no way to determine in which module a ufunc was > # defined. > new_ufunc = getattr(array_module, ufunc.__name__) > except AttributeError: > return NotImplemented > > try: > callable = getattr(new_ufunc, method) > except AttributeError: > return NotImplemented > > return callable(*inputs, **kwargs) > > class ArrayFunctionFromModuleMixin: > > def __array_function__(self, func, types, args, kwargs): > array_module = self.__array_module__(types) > if array_module is NotImplemented: > return NotImplemented > > # Traverse submodules to find the appropriate function > modules = func.__module__.split('.') > assert modules[0] == 'numpy' > for submodule in modules[1:]: > module = getattr(module, submodule, None) > new_func = getattr(module, func.__name__, None) > if new_func is None: > return NotImplemented > > return new_func(*args, **kwargs) > > To make it easier to write duck arrays, we could also add these mixin > classes > into ``numpy.lib.mixins`` (but the examples above may suffice). > > Alternatives considered > ----------------------- > > Naming > ====== > > We like the name ``__array_module__`` because it mirrors the existing > ``__array_function__`` and ``__array_ufunc__`` protocols. Another > reasonable > choice could be ``__array_namespace__``. > > It is less clear what the NumPy function that calls this protocol should be > called (``get_array_module`` in this proposal). Some possible alternatives: > ``array_module``, ``common_array_module``, ``resolve_array_module``, > ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, > ``get_duck_array_module``. > > .. _requesting-restricted-subsets: > > Requesting restricted subsets of NumPy's API > ============================================ > > Over time, NumPy has accumulated a very large API surface, with over 600 > attributes in the top level ``numpy`` module alone. It is unlikely that any > duck array library could or would want to implement all of these functions > and > classes, because the frequently used subset of NumPy is much smaller. > > We think it would be useful exercise to define "minimal" subset(s) of > NumPy's > API, omitting rarely used or non-recommended functionality. For example, > minimal NumPy might include ``stack``, but not the other stacking functions > ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could clearly > indicate to duck array authors and users want functionality is core and > what > functionality they can skip. > > Support for requesting a restricted subset of NumPy's API would be a > natural > feature to include in ``get_array_function`` and ``__array_module__``, > e.g., > > .. code:: python > > # array_module is only guaranteed to contain "minimal" NumPy > array_module = np.get_array_module(*arrays, request='minimal') > > To facilitate testing with NumPy and use with any valid duck array library, > NumPy itself would return restricted versions of the ``numpy`` module when > ``get_array_module`` is called only on NumPy arrays. Omitted functions > would > simply not exist. > > Unfortunately, we have not yet figured out what these restricted subsets > should > be, so it doesn't make sense to do this yet. When/if we do, we could > either add > new keyword arguments to ``get_array_module`` or add new top level > functions, > e.g., ``get_minimal_array_module``. We would also need to add either a new > protocol patterned off of ``__array_module__`` (e.g., > ``__array_module_minimal__``), or could add an optional second argument to > ``__array_module__`` (catching errors with ``try``/``except``). > > A new namespace for implicit dispatch > ===================================== > > Instead of supporting overrides in the main `numpy` namespace with > ``__array_function__``, we could create a new opt-in namespace, e.g., > ``numpy.api``, with versions of NumPy functions that support dispatching. > These > overrides would need new opt-in protocols, e.g., ``__array_function_api__`` > patterned off of ``__array_function__``. > > This would resolve the biggest limitations of ``__array_function__`` by > being > opt-in and would also allow for unambiguously overriding functions like > ``asarray``, because ``np.api.asarray`` would always mean "convert an > array-like object." But it wouldn't solve all the dispatching needs met by > ``__array_module__``, and would leave us with supporting a considerably > more > complex protocol both for array users and implementors. > > We could potentially implement such a new namespace *via* the > ``__array_module__`` protocol. Certainly some users would find this > convenient, > because it is slightly less boilerplate. But this would leave users with a > confusing choice: when should they use `get_array_module` vs. > `np.api.something`. Also, we would have to add and maintain a whole new > module, > which is considerably more expensive than merely adding a function. > > Dispatching on both types and arrays instead of only types > ========================================================== > > Instead of supporting dispatch only via unique array types, we could also > support dispatch via array objects, e.g., by passing an ``arrays`` > argument as > part of the ``__array_module__`` protocol. This could potentially be > useful for > dispatch for arrays with metadata, such provided by Dask and Pint, but > would > impose costs in terms of type safety and complexity. > > For example, a library that supports arrays on both CPUs and GPUs might > decide > on which device to create a new arrays from functions like ``ones`` based > on > input arguments: > > .. code:: python > > class Array: > def __array_module__(self, types, arrays): > useful_arrays = tuple(a in arrays if isinstance(a, Array)) > if not useful_arrays: > return NotImplemented > prefer_gpu = any(a.prefer_gpu for a in useful_arrays) > return ArrayModule(prefer_gpu) > > class ArrayModule: > def __init__(self, prefer_gpu): > self.prefer_gpu = prefer_gpu > > def __getattr__(self, name): > import base_module > base_func = getattr(base_module, name) > return functools.partial(base_func, prefer_gpu=self.prefer_gpu) > > This might be useful, but it's not clear if we really need it. Pint seems > to > get along OK without any explicit array creation routines (favoring > multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most > part > Dask is also OK with existing ``__array_function__`` style overides (e.g., > favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an > array > on the CPU or GPU could be solved by `making array creation lazy > <https://github.com/google/jax/pull/1668>`_. > > .. _appendix-design-choices: > > Appendix: design choices for API overrides > ------------------------------------------ > > There is a large range of possible design choices for overriding NumPy's > API. > Here we discuss three major axes of the design decision that guided our > design > for ``__array_module__``. > > Opt-in vs. opt-out for users > ============================ > > The ``__array_ufunc__`` and ``__array_function__`` protocols provide a > mechanism for overriding NumPy functions *within NumPy's existing > namespace*. > This means that users need to explicitly opt-out if they do not want any > overridden behavior, e.g., by casting arrays with ``np.asarray()``. > > In theory, this approach lowers the barrier for adopting these protocols in > user code and libraries, because code that uses the standard NumPy > namespace is > automatically compatible. But in practice, this hasn't worked out. For > example, > most well-maintained libraries that use NumPy follow the best practice of > casting all inputs with ``np.asarray()``, which they would have to > explicitly > relax to use ``__array_function__``. Our experience has been that making a > library compatible with a new duck array type typically requires at least a > small amount of work to accommodate differences in the data model and > operations > that can be implemented efficiently. > > These opt-out approaches also considerably complicate backwards > compatibility > for libraries that adopt these protocols, because by opting in as a library > they also opt-in their users, whether they expect it or not. For winning > over > libraries that have been unable to adopt ``__array_function__``, an opt-in > approach seems like a must. > > Explicit vs. implicit choice of implementation > ============================================== > > Both ``__array_ufunc__`` and ``__array_function__`` have implicit control > over > dispatching: the dispatched functions are determined via the appropriate > protocols in every function call. This generalizes well to handling many > different types of objects, as evidenced by its use for implementing > arithmetic > operators in Python, but it has two downsides: > > 1. *Speed*: it imposes additional overhead in every function call, because > each > function call needs to inspect each of its arguments for overrides. > This is > why arithmetic on builtin Python numbers is slow. > 2. *Readability*: it is not longer immediately evident to readers of code > what > happens when a function is called, because the function's implementation > could be overridden by any of its arguments. > > In contrast, importing a new library (e.g., ``import dask.array as da``) > with > an API matching NumPy is entirely explicit. There is no overhead from > dispatch > or ambiguity about which implementation is being used. > > Explicit and implicit choice of implementations are not mutually exclusive > options. Indeed, most implementations of NumPy API overrides via > ``__array_function__`` that we are familiar with (namely, dask, CuPy and > sparse, but not Pint) also include an explicit way to use their version of > NumPy's API by importing a module directly (``dask.array``, ``cupy`` or > ``sparse``, respectively). > > Local vs. non-local vs. global control > ====================================== > > The final design axis is how users control the choice of API: > > - **Local control**, as exemplified by multiple dispatch and Python > protocols for > arithmetic, determines which implementation to use either by checking > types > or calling methods on the direct arguments of a function. > - **Non-local control** such as `np.errstate > < > https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html > >`_ > overrides behavior with global-state via function decorators or > context-managers. Control is determined hierarchically, via the > inner-most > context. > - **Global control** provides a mechanism for users to set default > behavior, > either via function calls or configuration files. For example, matplotlib > allows setting a global choice of plotting backend. > > Local control is generally considered a best practice for API design, > because > control flow is entirely explicit, which makes it the easiest to > understand. > Non-local and global control are occasionally used, but generally either > due to > ignorance or a lack of better alternatives. > > In the case of duck typing for NumPy's public API, we think non-local or > global > control would be mistakes, mostly because they **don't compose well**. If > one > library sets/needs one set of overrides and then internally calls a routine > that expects another set of overrides, the resulting behavior may be very > surprising. Higher order functions are especially problematic, because the > context in which functions are evaluated may not be the context in which > they > are defined. > > One class of override use cases where we think non-local and global > control are > appropriate is for choosing a backend system that is guaranteed to have an > entirely consistent interface, such as a faster alternative implementation > of > ``numpy.fft`` on NumPy arrays. However, these are out of scope for the > current > proposal, which is focused on duck arrays. > > _______________________________________________ > NumPy-Discussion mailing listNumPy-Discussion@python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Thu, 2020-02-06 at 09:35 -0800, Stephan Hoyer wrote:
Hmmm, in NumPy we can easily force basically 100% of (desired) coverage, i.e. JAX can return a namespace that implements everything. With SciPy that is already muss less feasible, and as you go to domain specific tools it seems implausible. `get_array_module` solves the issue of a library that wants to support all array likes. As long as: * most functions rely only on the NumPy API * the domain specific library is expected to implement support for specific array objects if necessary. E.g. sklearn can include special code for Dask support. Dask does not replace sklearn code.
I suppose the question is here, where should the code reside? For SciPy, I agree there is a good reason why you may want to "reverse" the implementation. The code to support JAX arrays, should live inside JAX. One, probably silly, option is to return a "global" namespace, so that: np = get_array_module(*arrays).numpy` We have to distinct issues: Where should e.g. SciPy put a generic implementation (assuming they to provide implementations that only require NumPy-API support to not require overriding)? And, also if a library provides generic support, should we define a standard of how the context/namespace may be passed in/provided? sklearn's main namespace is expected to support many array objects/types, but it could be nice to pass in an already known context/namespace (say scikit-image already found it, and then calls scikit-learn internally). A "generic" namespace may even require this to infer the correct output array object. Another thing about backward compatibility: What is our vision there actually? This NEP will *not* give the *end user* the option to opt-in! Here, opt-in is really reserved to the *library user* (e.g. sklearn). (I did not realize this clearly before) Thinking about that for a bit now, that seems like the right choice. But it also means that the library requires an easy way of giving a FutureWarning, to notify the end-user of the upcoming change. The end- user will easily be able to convert to a NumPy array to keep the old behaviour. Once this warning is given (maybe during `get_array_module()`, the array module object/context would preferably be passed around, hopefully even between libraries. That provides a reasonable way to opt-in to the new behaviour without a warning (mainly for library users, end-users can silence the warning if they wish so). - Sebastian
![](https://secure.gravatar.com/avatar/93a76a800ef6c5919baa8ba91120ee98.jpg?s=120&d=mm&r=g)
On Thu, Feb 6, 2020 at 12:20 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
My main concern with a "global namespace" is that it adds boilerplate to the typical usage of fetching a duck-array version of NumPy. I think the simplest proposal is to add a "module" argument to both get_array_module and __array_module__, with a default value of "numpy". This adds flexibility with minimal additional complexity. The main question is what the type of arguments for "module" should be: 1. Modules could be specified as strings, e.g., "numpy" 2. Module could be specified as actual namespace, e.g., numpy from import numpy. The advantage of (1) is that in theory you could write np.get_array_module(*arrays, module='scipy.linalg') without the overhead of actually importing scipy.linalg or without even needing scipy to be installed, if all the arrays use a different scipy.linalg implementation. But in practice, this seems a little far-fetched. All alternative implementations of scipy that I know of (e.g., in JAX or conceivably in Dask) import the original library. The main downside of (1) is that it would would mean that NumPy's ndarray.__array_module__ would need to use importlib.import_module() to dynamically import modules. It also adds a potentially awkward asymmetry between the "module" and "default" arguments, unless we also switched default to specify modules with strings. Either way, the "default" argument will probably need to be adjusted so that by default it matches whatever value is passed into "module", instead of always defaulting to "numpy". Any thoughts on which of these options makes most sense? We could also put off making any changes to the protocol now, but this change seems pretty safe and appear to have real use-cases (e.g., for sklearn) so I am inclined to go ahead with it now before finalizing the NEP.
I don't think NumPy needs to do anything about warnings. It is straightforward for libraries that want to use use get_array_module() to issue their own warnings before calling get_array_module(), if desired. Or alternatively, if a library is about to add a new __array_module__ method, it is straightforward to issue a warning inside the new __array_module__ method before returning the NumPy functions.
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Sun, Feb 23, 2020 at 3:31 PM Stephan Hoyer <shoyer@gmail.com> wrote:
I don't think this is quite enough. Sebastian points out a fairly important issue. One of the main rationales for the whole NEP, and the argument in multiple places ( https://numpy.org/neps/nep-0037-array-module.html#opt-in-vs-opt-out-for-user...) is that it's now opt-in while __array_function__ was opt-out. This isn't really true - the problem is simply *moved*, from the duck array libraries to the array-consuming libraries. The end user will still see the backwards incompatible change, with no way to turn it off. It will be easier with __array_module__ to warn users, but this should be expanded on in the NEP. Also, I'm still not sure I agree with the tone of the discussion on this topic. It's very heavily inspired by what the JAX devs are telling you (the NEP still says PyTorch and scipy.sparse as well, but that's not true in both cases). If you ask Dask and CuPy for example, they're quite happy with __array_function__ and there haven't been many complaints about backwards compat breakage. Cheers, Ralf _______________________________________________
![](https://secure.gravatar.com/avatar/93a76a800ef6c5919baa8ba91120ee98.jpg?s=120&d=mm&r=g)
On Sun, Feb 23, 2020 at 3:59 PM Ralf Gommers <ralf.gommers@gmail.com> wrote:
Ralf, thanks for sharing your thoughts. I'm not quite I understand the concerns about backwards incompatibility: 1. The intention is that implementing a __array_module__ method should be backwards compatible with all current uses of NumPy. This satisfies backwards compatibility concerns for an array-implementing library like JAX. 2. In contrast, calling get_array_module() offers no guarantees about backwards compatibility. This seems nearly impossible, because the entire point of the protocol is to make it possible to opt-in to new behavior. So backwards compatibility isn't solved for Scikit-Learn switching to use get_array_module(), and after Scikit-Learn does so, adding __array_module__ to new types of arrays could potentially have backwards incompatible consequences for Scikit-Learn (unless sklearn uses default=None). Are you suggesting just adding something like what I'm writing here into the NEP? Perhaps along with advice to consider issuing warnings inside __array_module__ and falling back to legacy behavior when first implementing it on a new type? We could also potentially make a few changes to make backwards compatibility even easier, by making the protocol less aggressive about assuming that NumPy is a safe fallback. Some non-exclusive options: a. We could switch the default value of "default" on get_array_module() to None, so an exception is raised if nothing implements __array_module__. b. We could includes *all* argument types in "types", not just types that implement __array_module__. NumPy's ndarray.__array_module__ could then recognize and refuse to return an implementation if there are other arguments that might implement __array_module__ in the future (e.g., anything outside the standard library?). The downside of making either of these choices is that it would potentially make get_array_function() a bit less usable, because it is more likely to fail, e.g., if called on a float, or some custom type that should be treated as a scalar. Also, I'm still not sure I agree with the tone of the discussion on this
I'm linking to comments you wrote in reference to PyTorch and scipy.sparse in the current draft of the NEP, so I certainly want to make sure that you agree my characterization :). Would it be fair to say: - JAX is reluctant to implement __array_function__ because of concerns about breaking existing code. JAX developers think that when users use NumPy functions on JAX arrays, they are explicitly choosing to convert from JAX to NumPy. This model is fundamentally incompatible __array_function__, which we chose to override the existing numpy namespace. - PyTorch and scipy.sparse are not yet in position to implement __array_function__ (due to a lack of a direct implementation of NumPy's API), but these projects take backwards compatibility seriously. Does "take backwards compatibility seriously" sound about right to you? I'm very open to specific suggestions here. (TensorFlow could probably also be safely added to this second list.) Best, Stephan
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Sun, 2020-02-23 at 22:44 -0800, Stephan Hoyer wrote:
Just to be clear, the way scikit-learn would probably be handling backward compatibility concerns is by adding it to their configuration context manager, see: https://github.com/scikit-learn/scikit-learn/pull/16574 So the backward compat is in a sense solved (but there are project specific context managers involved – which is not perfect maybe, but OK). I am willing to consider pushing this off into its own namespace (and package, preferably in the NumPy org though) if necessary, the idea being that we keep it super minimal, and expand it as we go to keep up with scikit-learn needs. Possibly even with a function registration approach, so that you could have import time checks on function availability and signature mismatch easier. I still do not like the idea of context managers much though, I think I prefer the returned (bound) namespace a lot. Also I think we should *not* do implicit dispatching. Consider this case: def numpy_only(x): x = np.asarray(x) return x + _helper(len(x)) def generic(x): module = np.get_array_module(x) x = module.asarray(x) return x + _helper(len(x)) def _helper(n, module=np): return module.random.unform(size=n) If you try to make the above work with context managers, you _still_ need to pass in the module to _helper [1], because otherwise you would have to change the `numpy_only` function to ensure an outside context does not change its behaviour. - Sebastian [1] If "module" had a `module.set_backend()` and was a global instead `_helper` using the global module would do the wrong thing for `numpy_only`. This is of course also a bit of an issue with the sklearn context manager as well, but it seems to me _much_ less so, and probably not if most libraries slowly switch over and currently use `np.asarray`.
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Sun, 2020-02-23 at 22:44 -0800, Stephan Hoyer wrote:
I think that should be sufficient, personally. We could mention that scikit-learn will likely use a context manager to do this. We can also think about providing a global default (which sklearn can use as its own default if they wish so, but that is reserved to the end-user). That would be a small amendment, and I think we could add it even after accepting the NEP as it is.
I am not sure that I feel switching the default to None makes much of a difference to be honest. Unless we use it to signal a super strict mode similar to b. below.
That is a good point, anything that is not NumPy recognized could simply be rejected. It does mean that you have to call `module.asarray()` manually more often though. For `list`, it could also make sense to just add np.ndarray to types. If we want to be conservative, maybe we could also just error before calling `__array_module__`. Whenever there is something that we do not know how to interpret force the user to clarify?
Right, although we could relax it later if it seems overly annoying.
This will need input from Ralf, my personal main concern is backward compatibility in libraries: I am pretty sure sklearn would only use a potential `np.asduckarray` when the user opted in. But in that case my personal feeling is that the `get_array_module` solution is cleaner and makes it easier to expand functionality slowly (for libraries). Two other points: First, I am wondering if we should add something like a `__qualname__` to the contract. I.e. a returned module must have a well defined `module.__name__` (that is usually already correct), so that sklearn could do: module = np.get_array_module(*arrays) if module.__name__ not in ("numpy", "sparse"): raise TypeError("Currently only numpy and sparse are supported") if they wish so (that is trivial, but if you return a class acting as a module it may be important). Second, we have to make progress on whether or not the "restricted" namespace idea should have priority. My personal opinion is tending strongly towards no. The NumPy version should normally be older than other libraries, and if NumPy updates the API so do the downstream implementers. E.g. dask may have to provide multiple version of the same function depending on the installed NumPy version, but that seems OK to me? It is just as downstream libraries currently have to support multiple NumPy versions. We could add a contract that the first time `get_array_module` is used to e.g. get the dask namespace and the NumPy version is too new, a warning should be given. The practical thing seems to me that we ignore this for the moment (as something we can do later on)? If there is missing API, in most cases an AttributeError will be raised which could provide some additional information to the user? The only alternative seems the complete opposite?: Create a new module, and make even NumPy only one of the implementers of that new (restricted) module. That may be cleaner, but I fear that it is impractical to be honest. I will put this on the agenda for tomorrow, even if we discuss it only very briefly. My feeling (and hope) is that we are nearing a point where we can make a final decision. Best, Sebastian
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Wed, Mar 4, 2020 at 1:22 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Sorry, this never made it back to the top of my todo list.
Indeed, it is nearly impossible. Except if there's a context manager or some other control mechanism exposed to the end user. Hence that should be a part of the design I think. Otherwise you're just solving something for the JAX devs, but not for the scikit-learn/scipy/etc devs who will then each have to invent their own wheel for backwards compat. So backwards compatibility isn't solved for Scikit-Learn
+1 That would be a small amendment, and I think we could add it even after
I agree, that doesn't make a difference.
Interesting point. Not accepting sequences could be considered here. It may help a lot with robustness and typing to only accept ndarray, other objects with __array__, and scalars.
agreed
True. I would say though that scipy.sparse will never implement either __array_function__ or array_module__ due to semantic imcompatibilities (it acts like np.matrix). So it's kind of irrelevant. And if PyTorch gets around to adding a numpy-compatible API, they're fine with __array_function__.
I think it's quite important, and __array_module__ gives a chance to introduce it. However, it's not ready - so I'd say that if __array_module__ implementation is ready and there's no well-defined restricted API proposal (I expect to have that in August), then we can move ahead without it. The NumPy version should normally be older than other libraries, and if
That seems unworkable, and I don't think any libraries do this. Coupling the semantics of a single Dask function to the installed numpy version is odd. It is just as downstream libraries currently have to support multiple
I think we can't solve this until we have a well-defined API, which is the restricted API + API versioning. Until then it just remains with the current status, compatibility is implementation-defined. Cheers, Ralf
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Thu, 2020-04-09 at 13:52 +0200, Ralf Gommers wrote:
Is it all that odd? Libraries (not array providers) already need to test for NumPy version occasionally due to API changes, so they also have two versions of the same thing around (e.g. a fallback). This simply would move the burden to the array-object implementer to some degree. Assume that we have a versioned API in some form or another, it seems to me we either require: module = np.get_array_module(..., api_version=2) or define `module.__api_version__`. Where the latter means that sklearn/SciPy may have to check `__api_version__` on every function call, while currently such checks usually happen at import time. On the other hand, the former means that sklearn/scipy can only opt-in to new API after 3+ years easily? Saying that the NumPy version is what pins the api-version, is not much more than assuming/requiring that NumPy will be the least up-to-date package? Of course it is unworkable to get 100% right in practice but are you saying that because it seems like an impractical approach, or because the API surface is currently so large that, of course, we will never get it 100% right (but that is generally true, nobody will be able to implement NumPy 100% compatible)? `__array_function__` has same issue? If we change our API, Dask has to catch up. If SciPy expects it to be the old version though (based on the NumPy import) it will incorrectly assume the old-api will be used. - Sebastian
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Thu, Apr 9, 2020 at 6:54 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
That's completely different, it's internal to a library and not visible to end users via different signatures/behavior. This simply would move the burden to the array-object implementer to
Yes, this is the version I was thinking about.
That's anyway the case, has very little to do with API versioning I think - it's simply determined by minimum NumPy version supported.
Yes this, impractical and undesired. or because
That's true too, we *don't want* anyone to start adding compat features for outdated or "wish we could deprecate" NumPy features.
`__array_function__` has same issue? If we change our API, Dask has to catch up.
Yes, that's true. The restricted API should be more stable than the whole NumPy API, otherwise no one will be able to be fully compatible. If SciPy expects it to be the old version though (based on
the NumPy import) it will incorrectly assume the old-api will be used.
That's not incorrect unless it's a backwards-incompatible change, which should be rare. Cheers, Ralf
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 2/23/20 6:59 PM, Ralf Gommers wrote:
Might it be possible to flip this NEP back to opt-out while keeping the nice simplifications and configurabile array-creation routines, relative to __array_function__? That is, what if we define two modules, "numpy" and "numpy_strict". "numpy_strict" would raise an exception on duck-arrays defining __array_module__ (as numpy currently does). "numpy" would be a wrapper around "numpy_strict" that decorates all numpy methods with a call to "get_array_module(inputs).func(inputs)". Then end-user code that did "import numpy as np" would accept ducktypes by default, while library developers who want to signal they don't support ducktypes can opt-out by doing "import numpy_strict as np". Issues with `np.as_array` seem mitigated compared to __array_function__ since that method would now be ducktype-aware. Cheers, -Allan
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Fri, 2020-02-28 at 11:28 -0500, Allan Haldane wrote:
This would be possible, but I think we strongly leaned against the idea. Basically, if you have to opt-out, from a library perspective there may be `np.asarray` calls, which for example later call into C and expect arrays. So, I have large doubts that an opt-out solution works easily for library authors. Array function is opt-out, but effectively most clean library code already opted out... We had previously discussed the opposite, of having a namespace of implicit dispatching based on get_array_module, but if we keep array function around, I am not sure there is much reason for it.
My tendency is that if we want to go there, we would need to push ahead with the `np.duckarray()` idea instead. To be clear: I currently very much prefer the get_array_module() idea. It just seems much cleaner for library authors, and they are the primary issue at the moment in my opinion. - Sebastian
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Wed, 2020-04-08 at 17:04 -0400, Andreas Mueller wrote:
Hey, thanks for the ping. Things are a bit stuck right now. I think what we need is some clarity on the implications and alternatives. I was thinking about organizing a small conference call with the main people interested in the next weeks. There are also still some alternatives to this NEP in the race, and we may need to clarify which ones are actually still in the race... Maybe to see some of the possible sticking points: 1. What do we do about SciPy, have it under this umbrella? And how would we want to design that. 2. Context managers have some composition issues, maybe less so if they are in the downstream package. Or should we have global defaults as well? 3. How do we ensure safe transitions for users as much as possible. * If you use this, can functions suddenly return a different type in the future? * Should we force you to cast to NumPy arrays in a transition period, or force you to somehow silence a transition warning? 4. Is there a serious push to have a "reduced" API or even a versioned API? But I am probably forgetting some other things. In my personal opinion, I think NEP 37 with minor modifications is still the best duck in the race. I feel we should be able to find a reasonable solution for SciPy. Point 2. about Context managers may be true, but this is much smaller in scope from the ones uarray proposed IIRC, and I could not figure out major scoping issues with it yet (the sklearn draft). About the safe transition, that may be the stickiest point. But e.g. if you enable `get_array_module` sklearn could limit a certain function to error out if it finds something other than NumPy? The main problem is how to do opt-in into future behaviour. A context manager can do that, although the danger is that someone just uses that everywhere... On the reduced/versioned API front, I would hope that we can defer that as a semi-orthogonal issue, basically saying that for now you have to provide a NumPy API that faithfully reproduces whatever NumPy version is installed on the system. Cheers, Sebastian
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Thu, Apr 9, 2020 at 12:02 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Current feeling: best to ignore it for now. It's quite a bit of work to fix API incompatibilities for linalg that no one currently seems interested in tackling. We can revisit once that's done.
+1 for adding this right next to get_array_module().
There is, it'll take a few months.
I think it would be nice to have a separate NEP 37 implementation outside of NumPy to play with. Unlike __array_function__, I don't think it has to go into NumPy immediately. This avoids the whole "experimental API" issue, it would be quite useful to test this with, e.g., CuPy + scikit-learn without being stuck with any decisions in a released NumPy version. Also makes switching on/off very easy for users, just (don't) `pip install numpy-array-module`. Cheers, Ralf
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Thu, 2020-04-09 at 13:52 +0200, Ralf Gommers wrote:
<snip>
Fair enough, I have created a hopefully working start here: https://github.com/seberg/numpy_dispatch (this is not tested much at all yet, so it could be very buggy). There are a couple of additional features that I added. 1. A global opt-in (it is impossible to opt-out once opted in!) 2. A local opt-in (to guarantee opt-in if global flag is not set) 3. I added features to allow transitioning:: get_array_module(*arrays, modules="numpy", future_modules=("dask.array", "cupy"), fallback="warn") Will give FutureWarning/DeprecationWarning where necessary, in the above "numpy" is supported, dask and cupy are supported but not enabled by default. `None` works to say "all modules". Once the transition is done, just move dask and cupy into `modules` and remove `fallback=None`. 4. If there are FutureWarnings/DeprecationWarnigs the user needs to be able to opt-in to future behaviour. Opting out can be done by casting inputs. Opting-in is done using:: with future_dispatch_behavior(): call_library_function() Obviously, we may not want these features, but I was curious how we could provide the tools to allow clean transitions. Both context managers should be thread-safe, but I did not test that. The best try would probably be cupy and sklearn again, so I will give a ping on the sklearn PR. To make that easier, I tried to hack a bit of a "util" to allow testing (please scroll down on the readme on github). Best, Sebastian
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Thu, 2020-04-09 at 22:11 -0500, Sebastian Berg wrote:
There is no immediate need to put modules and future_modules and fallback in there. The main convenience it gives is that we can more easily provide the user to opt-in context manager to opt-in to the new behaviour. Without that, libraries will have to do these checks, that is not difficult. But if we wish to provide a context manager to opt all of that in, the library will need additional API to query our context manager state. Or every library needs their own solution, which does not seem desirable (although it means you cannot opt-in internal functions accidentally to newer behaviour). - Sebastian
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Fri, Apr 10, 2020 at 5:17 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Thanks!
So future_modules explicitly excludes compatible libraries that are not listed. Why would you want anyone to do that? I don't understand "supported but not enabled", and it looks undesirable to me to special-case any library in this mechanism. Cheers, Ralf 4. If there are FutureWarnings/DeprecationWarnigs the user needs to be
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Fri, 2020-04-10 at 12:27 +0200, Ralf Gommers wrote:
We hav two (or three) types of modules (either could be "all"). 1. Supported modules that we dispatch to. 2. Modules that are supported but will be dispatched to by default only in the future. So if the user got a future_module, they will get a FutureWarning. They have to chose to cast the inputs or opt-in to the future behaviour. 3. Unsupported modules: If this is resolved it is an error. I currently assume that this does not need to be a negative list. You need to distinguish those somehow, since you need a way to transition. Even if you expect that modules would always be *all* modules, `numpy` is still the only accepted module originally. So, as I said, `future_modules` is only about transitioning and enabling `FutureWarning`s. Does not have to live there, but we need a way to transition. These options do not have to be handled by us, it only helps here with having context managers to opt-in to new behaviour, and maybe to get an idea for how transitions can look like. Alternatively, we could all to create project specific context managers to do the same and avoid possible scoping issues even more. - Sebastian
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Fri, Apr 10, 2020 at 3:03 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
I think we only have modules that implement __array_module__, and ones that don't.
Sorry, I still don't get it - transition what? You seem to be operating on the assumption that the users of get_array_module want or need to control which numpy-like libraries they allow and which they don't. That seems fundamentally wrong. How would you treat, for example, an array library that is developed privately inside some company? Cheers, Ralf
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Fri, 2020-04-10 at 18:19 +0200, Ralf Gommers wrote:
Well, you still need to transition from NumPy -> allow everything, so for now please just ignore that part if you like and use/assume: get_array_module(..., modules="numpy", future_modules=None, fallback="warn") during the transition, and: get_array_module(...) after it. After all this is a draft-project right now, so it is just as much about trying out what can be done. It is not unlikely that this transition burden will be put more on the library in any case, but it shows that it can be done. As to my "fundamentally wrong" assumption. Should libraries goal be to support everything? Definitely! But... I do not want to make that decision for libraries, so I if library authors tell me that they have no interest in it, all the better. Until then I am more than happy to keep that option on the table. If just as a thought for library authors to consider their options. Possible, brainstorming, reasons could be: 1. Say I currently heavily use cython code, so I am limited to NumPy (or at least arrays that can expose a buffer/`__array_interface__`). Now if someone adds a CUDA implementation, I would support cupy arrays, but not distributed arrays. I admit maybe checking that at function entry like this is the wrong approach there. 2. To limit to certain types is to say "We know (and test) that our library works with xarray, Dask, NumPy, and CuPy". Now you can say that is also a misconception, because if you stick to just NumPy API you should know that it will "just work" with everything. But in practice it seems like it might happen? In that case you may want to actually allow any odd array and just put a warning, a bit like the transition warnings I put in for testing. --- There are two other things I am wondering about. 1. Subclasses may want to return their superclasses module (even by default?), in which case their behaviour depends on the superclass module behaviour. Further a library would need to use `np.asanyarray()` to prevent the subclass from being cast to the superclass. 2. There is one transition that does not quite exists. What if an array-like starts implementing or expands `array-module`? That seems fine, but in that case the array-like will have to provide the `opt-in` context manager with a FutureWarning. The transition from no `__array_module__` to implementing it may need some thought, but I expect it is fine: The array-like simply always gives a FutureWarning, although it cannot know what will actually happen in the future (no change, error, or array-like takes control). - Sebastian
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
Thanks, maybe to start discussion floating the actual usage here: ``` def add_noise(array_like): module = np.get_array_module(array_like) noise = module.random.randn(*array_like.shape) return array_like + noise ``` The above function could also include `module.asarray(array_like)` to support non-array inputs. Importantly, the random function, and especially array creation functions such as `empty` and `ones` can work. To summarize I think there are two main things that this NEP can address: 1. Some libraries are reluctant to adopt `__array_function__`, but they could adopt this NEP. 2. Libraries written for numpy (scipy, sklearn, etc.) often use `np.asarray` and `__array_function__` does not help them easily. This NEP hopefully gives them a way forward. We may need to prototype some examples, but right now it feels like this should be a step forward, especially for libraries. Of course there are other similar design options, so discussions (or criticism of this idea) are welcome. I believe this can help libraries, i.e. if skimage only feels confident that they support Dask, they can still do: ``` module = np.get_array_module(*input_arrays) if module not in {np, dask.numpy_api}: raise TypeError("This function only supports numpy and Dask.") ``` I do not think this is as cleanly possibly with `__array_function__`. Best, Sebastian On Mon, 2020-01-06 at 20:29 -0800, Stephan Hoyer wrote:
![](https://secure.gravatar.com/avatar/84f619d24e0f165f2ee36db34a911c4a.jpg?s=120&d=mm&r=g)
A bit late to the NEP 37 party. I just wanted to say that at least from my perspective it seems a great solution that will help sklearn move towards more flexible compute engines. I think one of the biggest issues is array creation (including random arrays), and that's handled quite nicely with NEP 37. There's some discussion on the scikit-learn side here: https://github.com/scikit-learn/scikit-learn/pull/14963 https://github.com/scikit-learn/scikit-learn/issues/11447 Two different groups of people tried to use __array_function__ to delegate to MxNet and CuPy respectively in scikit-learn, and ran into the same issues. There's some remaining issues in sklearn that will not be handled by NEP 37 but they go beyond NumPy in some sense. Just to briefly bring them up: - We use scipy.linalg in many places, and we would need to do a separate dispatching to check whether we can use module.linalg instead (that might be an issue for many libraries but I'm not sure). - Some models have several possible optimization algorithms, some of which are pure numpy and some which are Cython. If someone provides a different array module, we might want to choose an algorithm that is actually supported by that module. While this exact issue is maybe sklearn specific, a similar issue could appear for most downstream libs that use Cython in some places. Many Cython algorithms could be implemented in pure numpy with a potential slowdown, but once we have NEP 37 there might be a benefit to having a pure NumPy implementation as an alternative code path. Anyway, NEP 37 seems a great step in the right direction and would enable sklearn to actually dispatch in some places. Dispatching just based on __array_function__ seems not really feasible so far. Best, Andreas Mueller On 1/6/20 11:29 PM, Stephan Hoyer wrote:
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Wed, Feb 5, 2020 at 10:01 AM Andreas Mueller <t3kcit@gmail.com> wrote: > A bit late to the NEP 37 party. > I just wanted to say that at least from my perspective it seems a great > solution that will help sklearn move towards more flexible compute engines. > I think one of the biggest issues is array creation (including random > arrays), and that's handled quite nicely with NEP 37. > > There's some discussion on the scikit-learn side here: > https://github.com/scikit-learn/scikit-learn/pull/14963 > https://github.com/scikit-learn/scikit-learn/issues/11447 > > Two different groups of people tried to use __array_function__ to delegate > to MxNet and CuPy respectively in scikit-learn, and ran into the same > issues. > > There's some remaining issues in sklearn that will not be handled by NEP > 37 but they go beyond NumPy in some sense. > Just to briefly bring them up: > > - We use scipy.linalg in many places, and we would need to do a separate > dispatching to check whether we can use module.linalg instead > (that might be an issue for many libraries but I'm not sure). > That is an issue, and goes in the opposite direction we need - scipy.linalg is a superset of numpy.linalg, so we'd like to encourage using scipy. This is something we may want to consider fixing by making the dispatch decorator public in numpy and adopting in scipy. Cheers, Ralf > > - Some models have several possible optimization algorithms, some of which > are pure numpy and some which are Cython. If someone provides a different > array module, > we might want to choose an algorithm that is actually supported by that > module. While this exact issue is maybe sklearn specific, a similar issue > could appear for most downstream libs that use Cython in some places. > Many Cython algorithms could be implemented in pure numpy with a > potential slowdown, but once we have NEP 37 there might be a benefit to > having a pure NumPy implementation as an alternative code path. > > > Anyway, NEP 37 seems a great step in the right direction and would enable > sklearn to actually dispatch in some places. Dispatching just based on > __array_function__ seems not really feasible so far. > > Best, > Andreas Mueller > > > On 1/6/20 11:29 PM, Stephan Hoyer wrote: > > I am pleased to present a new NumPy Enhancement Proposal for discussion: > "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be > very welcome! > > The full text follows. The rendered proposal can also be found online at > https://numpy.org/neps/nep-0037-array-module.html > > Best, > Stephan Hoyer > > =================================================== > NEP 37 — A dispatch protocol for NumPy-like modules > =================================================== > > :Author: Stephan Hoyer <shoyer@google.com> > :Author: Hameer Abbasi > :Author: Sebastian Berg > :Status: Draft > :Type: Standards Track > :Created: 2019-12-29 > > Abstract > -------- > > NEP-18's ``__array_function__`` has been a mixed success. Some projects > (e.g., > dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others > (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a new > protocol, ``__array_module__``, that we expect could eventually subsume > most > use-cases for ``__array_function__``. The protocol requires explicit > adoption > by both users and library authors, which ensures backwards compatibility, > and > is also significantly simpler than ``__array_function__``, both of which we > expect will make it easier to adopt. > > Why ``__array_function__`` hasn't been enough > --------------------------------------------- > > There are two broad ways in which NEP-18 has fallen short of its goals: > > 1. **Maintainability concerns**. `__array_function__` has significant > implications for libraries that use it: > > - Projects like `PyTorch > <https://github.com/pytorch/pytorch/issues/22402>`_, `JAX > <https://github.com/google/jax/issues/1565>`_ and even `scipy.sparse > <https://github.com/scipy/scipy/issues/10362>`_ have been reluctant > to > implement `__array_function__` in part because they are concerned > about > **breaking existing code**: users expect NumPy functions like > ``np.concatenate`` to return NumPy arrays. This is a fundamental > limitation of the ``__array_function__`` design, which we chose to > allow > overriding the existing ``numpy`` namespace. > - ``__array_function__`` currently requires an "all or nothing" > approach to > implementing NumPy's API. There is no good pathway for **incremental > adoption**, which is particularly problematic for established projects > for which adopting ``__array_function__`` would result in breaking > changes. > - It is no longer possible to use **aliases to NumPy functions** within > modules that support overrides. For example, both CuPy and JAX set > ``result_type = np.result_type``. > - Implementing **fall-back mechanisms** for unimplemented NumPy > functions > by using NumPy's implementation is hard to get right (but see the > `version from dask <https://github.com/dask/dask/pull/5043>`_), > because > ``__array_function__`` does not present a consistent interface. > Converting all arguments of array type requires recursing into generic > arguments of the form ``*args, **kwargs``. > > 2. **Limitations on what can be overridden.** ``__array_function__`` has > some > important gaps, most notably array creation and coercion functions: > > - **Array creation** routines (e.g., ``np.arange`` and those in > ``np.random``) need some other mechanism for indicating what type of > arrays to create. `NEP 36 <https://github.com/numpy/numpy/pull/14715 > >`_ > proposed adding optional ``like=`` arguments to functions without > existing array arguments. However, we still lack any mechanism to > override methods on objects, such as those needed by > ``np.random.RandomState``. > - **Array conversion** can't reuse the existing coercion functions like > ``np.asarray``, because ``np.asarray`` sometimes means "convert to an > exact ``np.ndarray``" and other times means "convert to something > _like_ > a NumPy array." This led to the `NEP 30 > <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_ > proposal for > a separate ``np.duckarray`` function, but this still does not resolve > how > to cast one duck array into a type matching another duck array. > > ``get_array_module`` and the ``__array_module__`` protocol > ---------------------------------------------------------- > > We propose a new user-facing mechanism for dispatching to a duck-array > implementation, ``numpy.get_array_module``. ``get_array_module`` performs > the > same type resolution as ``__array_function__`` and returns a module with > an API > promised to match the standard interface of ``numpy`` that can implement > operations on all provided array types. > > The protocol itself is both simpler and more powerful than > ``__array_function__``, because it doesn't need to worry about actually > implementing functions. We believe it resolves most of the maintainability > and > functionality limitations of ``__array_function__``. > > The new protocol is opt-in, explicit and with local control; see > :ref:`appendix-design-choices` for discussion on the importance of these > design > features. > > The array module contract > ========================= > > Modules returned by ``get_array_module``/``__array_module__`` should make a > best effort to implement NumPy's core functionality on new array types(s). > Unimplemented functionality should simply be omitted (e.g., accessing an > unimplemented function should raise ``AttributeError``). In the future, we > anticipate codifying a protocol for requesting restricted subsets of > ``numpy``; > see :ref:`requesting-restricted-subsets` for more details. > > How to use ``get_array_module`` > =============================== > > Code that wants to support generic duck arrays should explicitly call > ``get_array_module`` to determine an appropriate array module from which to > call functions, rather than using the ``numpy`` namespace directly. For > example: > > .. code:: python > > # calls the appropriate version of np.something for x and y > module = np.get_array_module(x, y) > module.something(x, y) > > Both array creation and array conversion are supported, because > dispatching is > handled by ``get_array_module`` rather than via the types of function > arguments. For example, to use random number generation functions or > methods, > we can simply pull out the appropriate submodule: > > .. code:: python > > def duckarray_add_random(array): > module = np.get_array_module(array) > noise = module.random.randn(*array.shape) > return array + noise > > We can also write the duck-array ``stack`` function from `NEP 30 > <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_, without the > need > for a new ``np.duckarray`` function: > > .. code:: python > > def duckarray_stack(arrays): > module = np.get_array_module(*arrays) > arrays = [module.asarray(arr) for arr in arrays] > shapes = {arr.shape for arr in arrays} > if len(shapes) != 1: > raise ValueError('all input arrays must have the same shape') > expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] > return module.concatenate(expanded_arrays, axis=0) > > By default, ``get_array_module`` will return the ``numpy`` module if no > arguments are arrays. This fall-back can be explicitly controlled by > providing > the ``module`` keyword-only argument. It is also possible to indicate that > an > exception should be raised instead of returning a default array module by > setting ``module=None``. > > How to implement ``__array_module__`` > ===================================== > > Libraries implementing a duck array type that want to support > ``get_array_module`` need to implement the corresponding protocol, > ``__array_module__``. This new protocol is based on Python's dispatch > protocol > for arithmetic, and is essentially a simpler version of > ``__array_function__``. > > Only one argument is passed into ``__array_module__``, a Python collection > of > unique array types passed into ``get_array_module``, i.e., all arguments > with > an ``__array_module__`` attribute. > > The special method should either return an namespace with an API matching > ``numpy``, or ``NotImplemented``, indicating that it does not know how to > handle the operation: > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if not all(issubclass(t, MyArray) for t in types): > return NotImplemented > return my_array_module > > Returning custom objects from ``__array_module__`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ``my_array_module`` will typically, but need not always, be a Python > module. > Returning a custom objects (e.g., with functions implemented via > ``__getattr__``) may be useful for some advanced use cases. > > For example, custom objects could allow for partial implementations of duck > array modules that fall-back to NumPy (although this is not recommended in > general because such fall-back behavior can be error prone): > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if all(issubclass(t, MyArray) for t in types): > return ArrayModule() > else: > return NotImplemented > > class ArrayModule: > def __getattr__(self, name): > import base_module > return getattr(base_module, name, getattr(numpy, name)) > > Subclassing from ``numpy.ndarray`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > All of the same guidance about well-defined type casting hierarchies from > NEP-18 still applies. ``numpy.ndarray`` itself contains a matching > implementation of ``__array_module__``, which is convenient for > subclasses: > > .. code:: python > > class ndarray: > def __array_module__(self, types): > if all(issubclass(t, ndarray) for t in types): > return numpy > else: > return NotImplemented > > NumPy's internal machinery > ========================== > > The type resolution rules of ``get_array_module`` follow the same model as > Python and NumPy's existing dispatch protocols: subclasses are called > before > super-classes, and otherwise left to right. ``__array_module__`` is > guaranteed > to be called only a single time on each unique type. > > The actual implementation of `get_array_module` will be in C, but should be > equivalent to this Python code: > > .. code:: python > > def get_array_module(*arrays, default=numpy): > implementing_arrays, types = _implementing_arrays_and_types(arrays) > if not implementing_arrays and default is not None: > return default > for array in implementing_arrays: > module = array.__array_module__(types) > if module is not NotImplemented: > return module > raise TypeError("no common array module found") > > def _implementing_arrays_and_types(relevant_arrays): > types = [] > implementing_arrays = [] > for array in relevant_arrays: > t = type(array) > if t not in types and hasattr(t, '__array_module__'): > types.append(t) > # Subclasses before superclasses, otherwise left to right > index = len(implementing_arrays) > for i, old_array in enumerate(implementing_arrays): > if issubclass(t, type(old_array)): > index = i > break > implementing_arrays.insert(index, array) > return implementing_arrays, types > > Relationship with ``__array_ufunc__`` and ``__array_function__`` > ---------------------------------------------------------------- > > These older protocols have distinct use-cases and should remain > =============================================================== > > ``__array_module__`` is intended to resolve limitations of > ``__array_function__``, so it is natural to consider whether it could > entirely > replace ``__array_function__``. This would offer dual benefits: (1) > simplifying > the user-story about how to override NumPy and (2) removing the slowdown > associated with checking for dispatch when calling every NumPy function. > > However, ``__array_module__`` and ``__array_function__`` are pretty > different > from a user perspective: it requires explicit calls to > ``get_array_function``, > rather than simply reusing original ``numpy`` functions. This is probably > fine > for *libraries* that rely on duck-arrays, but may be frustratingly verbose > for > interactive use. > > Some of the dispatching use-cases for ``__array_ufunc__`` are also solved > by > ``__array_module__``, but not all of them. For example, it is still useful > to > be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a > generic way > on non-NumPy arrays (e.g., with dask.array). > > Given their existing adoption and distinct use cases, we don't think it > makes > sense to remove or deprecate ``__array_function__`` and > ``__array_ufunc__`` at > this time. > > Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` > ========================================================================= > > Despite the user-facing differences, ``__array_module__`` and a module > implementing NumPy's API still contain sufficient functionality needed to > implement dispatching with the existing duck array protocols. > > For example, the following mixin classes would provide sensible defaults > for > these special methods in terms of ``get_array_module`` and > ``__array_module__``: > > .. code:: python > > class ArrayUfuncFromModuleMixin: > > def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): > arrays = inputs + kwargs.get('out', ()) > try: > array_module = np.get_array_module(*arrays) > except TypeError: > return NotImplemented > > try: > # Note this may have false positive matches, if > ufunc.__name__ > # matches the name of a ufunc defined by NumPy. > Unfortunately > # there is no way to determine in which module a ufunc was > # defined. > new_ufunc = getattr(array_module, ufunc.__name__) > except AttributeError: > return NotImplemented > > try: > callable = getattr(new_ufunc, method) > except AttributeError: > return NotImplemented > > return callable(*inputs, **kwargs) > > class ArrayFunctionFromModuleMixin: > > def __array_function__(self, func, types, args, kwargs): > array_module = self.__array_module__(types) > if array_module is NotImplemented: > return NotImplemented > > # Traverse submodules to find the appropriate function > modules = func.__module__.split('.') > assert modules[0] == 'numpy' > for submodule in modules[1:]: > module = getattr(module, submodule, None) > new_func = getattr(module, func.__name__, None) > if new_func is None: > return NotImplemented > > return new_func(*args, **kwargs) > > To make it easier to write duck arrays, we could also add these mixin > classes > into ``numpy.lib.mixins`` (but the examples above may suffice). > > Alternatives considered > ----------------------- > > Naming > ====== > > We like the name ``__array_module__`` because it mirrors the existing > ``__array_function__`` and ``__array_ufunc__`` protocols. Another > reasonable > choice could be ``__array_namespace__``. > > It is less clear what the NumPy function that calls this protocol should be > called (``get_array_module`` in this proposal). Some possible alternatives: > ``array_module``, ``common_array_module``, ``resolve_array_module``, > ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, > ``get_duck_array_module``. > > .. _requesting-restricted-subsets: > > Requesting restricted subsets of NumPy's API > ============================================ > > Over time, NumPy has accumulated a very large API surface, with over 600 > attributes in the top level ``numpy`` module alone. It is unlikely that any > duck array library could or would want to implement all of these functions > and > classes, because the frequently used subset of NumPy is much smaller. > > We think it would be useful exercise to define "minimal" subset(s) of > NumPy's > API, omitting rarely used or non-recommended functionality. For example, > minimal NumPy might include ``stack``, but not the other stacking functions > ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could clearly > indicate to duck array authors and users want functionality is core and > what > functionality they can skip. > > Support for requesting a restricted subset of NumPy's API would be a > natural > feature to include in ``get_array_function`` and ``__array_module__``, > e.g., > > .. code:: python > > # array_module is only guaranteed to contain "minimal" NumPy > array_module = np.get_array_module(*arrays, request='minimal') > > To facilitate testing with NumPy and use with any valid duck array library, > NumPy itself would return restricted versions of the ``numpy`` module when > ``get_array_module`` is called only on NumPy arrays. Omitted functions > would > simply not exist. > > Unfortunately, we have not yet figured out what these restricted subsets > should > be, so it doesn't make sense to do this yet. When/if we do, we could > either add > new keyword arguments to ``get_array_module`` or add new top level > functions, > e.g., ``get_minimal_array_module``. We would also need to add either a new > protocol patterned off of ``__array_module__`` (e.g., > ``__array_module_minimal__``), or could add an optional second argument to > ``__array_module__`` (catching errors with ``try``/``except``). > > A new namespace for implicit dispatch > ===================================== > > Instead of supporting overrides in the main `numpy` namespace with > ``__array_function__``, we could create a new opt-in namespace, e.g., > ``numpy.api``, with versions of NumPy functions that support dispatching. > These > overrides would need new opt-in protocols, e.g., ``__array_function_api__`` > patterned off of ``__array_function__``. > > This would resolve the biggest limitations of ``__array_function__`` by > being > opt-in and would also allow for unambiguously overriding functions like > ``asarray``, because ``np.api.asarray`` would always mean "convert an > array-like object." But it wouldn't solve all the dispatching needs met by > ``__array_module__``, and would leave us with supporting a considerably > more > complex protocol both for array users and implementors. > > We could potentially implement such a new namespace *via* the > ``__array_module__`` protocol. Certainly some users would find this > convenient, > because it is slightly less boilerplate. But this would leave users with a > confusing choice: when should they use `get_array_module` vs. > `np.api.something`. Also, we would have to add and maintain a whole new > module, > which is considerably more expensive than merely adding a function. > > Dispatching on both types and arrays instead of only types > ========================================================== > > Instead of supporting dispatch only via unique array types, we could also > support dispatch via array objects, e.g., by passing an ``arrays`` > argument as > part of the ``__array_module__`` protocol. This could potentially be > useful for > dispatch for arrays with metadata, such provided by Dask and Pint, but > would > impose costs in terms of type safety and complexity. > > For example, a library that supports arrays on both CPUs and GPUs might > decide > on which device to create a new arrays from functions like ``ones`` based > on > input arguments: > > .. code:: python > > class Array: > def __array_module__(self, types, arrays): > useful_arrays = tuple(a in arrays if isinstance(a, Array)) > if not useful_arrays: > return NotImplemented > prefer_gpu = any(a.prefer_gpu for a in useful_arrays) > return ArrayModule(prefer_gpu) > > class ArrayModule: > def __init__(self, prefer_gpu): > self.prefer_gpu = prefer_gpu > > def __getattr__(self, name): > import base_module > base_func = getattr(base_module, name) > return functools.partial(base_func, prefer_gpu=self.prefer_gpu) > > This might be useful, but it's not clear if we really need it. Pint seems > to > get along OK without any explicit array creation routines (favoring > multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most > part > Dask is also OK with existing ``__array_function__`` style overides (e.g., > favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an > array > on the CPU or GPU could be solved by `making array creation lazy > <https://github.com/google/jax/pull/1668>`_. > > .. _appendix-design-choices: > > Appendix: design choices for API overrides > ------------------------------------------ > > There is a large range of possible design choices for overriding NumPy's > API. > Here we discuss three major axes of the design decision that guided our > design > for ``__array_module__``. > > Opt-in vs. opt-out for users > ============================ > > The ``__array_ufunc__`` and ``__array_function__`` protocols provide a > mechanism for overriding NumPy functions *within NumPy's existing > namespace*. > This means that users need to explicitly opt-out if they do not want any > overridden behavior, e.g., by casting arrays with ``np.asarray()``. > > In theory, this approach lowers the barrier for adopting these protocols in > user code and libraries, because code that uses the standard NumPy > namespace is > automatically compatible. But in practice, this hasn't worked out. For > example, > most well-maintained libraries that use NumPy follow the best practice of > casting all inputs with ``np.asarray()``, which they would have to > explicitly > relax to use ``__array_function__``. Our experience has been that making a > library compatible with a new duck array type typically requires at least a > small amount of work to accommodate differences in the data model and > operations > that can be implemented efficiently. > > These opt-out approaches also considerably complicate backwards > compatibility > for libraries that adopt these protocols, because by opting in as a library > they also opt-in their users, whether they expect it or not. For winning > over > libraries that have been unable to adopt ``__array_function__``, an opt-in > approach seems like a must. > > Explicit vs. implicit choice of implementation > ============================================== > > Both ``__array_ufunc__`` and ``__array_function__`` have implicit control > over > dispatching: the dispatched functions are determined via the appropriate > protocols in every function call. This generalizes well to handling many > different types of objects, as evidenced by its use for implementing > arithmetic > operators in Python, but it has two downsides: > > 1. *Speed*: it imposes additional overhead in every function call, because > each > function call needs to inspect each of its arguments for overrides. > This is > why arithmetic on builtin Python numbers is slow. > 2. *Readability*: it is not longer immediately evident to readers of code > what > happens when a function is called, because the function's implementation > could be overridden by any of its arguments. > > In contrast, importing a new library (e.g., ``import dask.array as da``) > with > an API matching NumPy is entirely explicit. There is no overhead from > dispatch > or ambiguity about which implementation is being used. > > Explicit and implicit choice of implementations are not mutually exclusive > options. Indeed, most implementations of NumPy API overrides via > ``__array_function__`` that we are familiar with (namely, dask, CuPy and > sparse, but not Pint) also include an explicit way to use their version of > NumPy's API by importing a module directly (``dask.array``, ``cupy`` or > ``sparse``, respectively). > > Local vs. non-local vs. global control > ====================================== > > The final design axis is how users control the choice of API: > > - **Local control**, as exemplified by multiple dispatch and Python > protocols for > arithmetic, determines which implementation to use either by checking > types > or calling methods on the direct arguments of a function. > - **Non-local control** such as `np.errstate > < > https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html > >`_ > overrides behavior with global-state via function decorators or > context-managers. Control is determined hierarchically, via the > inner-most > context. > - **Global control** provides a mechanism for users to set default > behavior, > either via function calls or configuration files. For example, matplotlib > allows setting a global choice of plotting backend. > > Local control is generally considered a best practice for API design, > because > control flow is entirely explicit, which makes it the easiest to > understand. > Non-local and global control are occasionally used, but generally either > due to > ignorance or a lack of better alternatives. > > In the case of duck typing for NumPy's public API, we think non-local or > global > control would be mistakes, mostly because they **don't compose well**. If > one > library sets/needs one set of overrides and then internally calls a routine > that expects another set of overrides, the resulting behavior may be very > surprising. Higher order functions are especially problematic, because the > context in which functions are evaluated may not be the context in which > they > are defined. > > One class of override use cases where we think non-local and global > control are > appropriate is for choosing a backend system that is guaranteed to have an > entirely consistent interface, such as a faster alternative implementation > of > ``numpy.fft`` on NumPy arrays. However, these are out of scope for the > current > proposal, which is focused on duck arrays. > > _______________________________________________ > NumPy-Discussion mailing listNumPy-Discussion@python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
![](https://secure.gravatar.com/avatar/209654202cde8ec709dee0a4d23c717d.jpg?s=120&d=mm&r=g)
> scipy.linalg is a superset of numpy.linalg This isn't completely accurate - numpy.linalg supports almost all operations* over stacks of matrices via gufuncs, but scipy.linalg does not appear to. Eric *: not lstsq due to an ungeneralizable public API On Wed, 5 Feb 2020 at 17:38, Ralf Gommers <ralf.gommers@gmail.com> wrote: > > > On Wed, Feb 5, 2020 at 10:01 AM Andreas Mueller <t3kcit@gmail.com> wrote: > >> A bit late to the NEP 37 party. >> I just wanted to say that at least from my perspective it seems a great >> solution that will help sklearn move towards more flexible compute engines. >> I think one of the biggest issues is array creation (including random >> arrays), and that's handled quite nicely with NEP 37. >> >> There's some discussion on the scikit-learn side here: >> https://github.com/scikit-learn/scikit-learn/pull/14963 >> https://github.com/scikit-learn/scikit-learn/issues/11447 >> >> Two different groups of people tried to use __array_function__ to >> delegate to MxNet and CuPy respectively in scikit-learn, and ran into the >> same issues. >> >> There's some remaining issues in sklearn that will not be handled by NEP >> 37 but they go beyond NumPy in some sense. >> Just to briefly bring them up: >> >> - We use scipy.linalg in many places, and we would need to do a separate >> dispatching to check whether we can use module.linalg instead >> (that might be an issue for many libraries but I'm not sure). >> > > That is an issue, and goes in the opposite direction we need - > scipy.linalg is a superset of numpy.linalg, so we'd like to encourage using > scipy. This is something we may want to consider fixing by making the > dispatch decorator public in numpy and adopting in scipy. > > Cheers, > Ralf > > > >> >> - Some models have several possible optimization algorithms, some of >> which are pure numpy and some which are Cython. If someone provides a >> different array module, >> we might want to choose an algorithm that is actually supported by that >> module. While this exact issue is maybe sklearn specific, a similar issue >> could appear for most downstream libs that use Cython in some places. >> Many Cython algorithms could be implemented in pure numpy with a >> potential slowdown, but once we have NEP 37 there might be a benefit to >> having a pure NumPy implementation as an alternative code path. >> >> >> Anyway, NEP 37 seems a great step in the right direction and would enable >> sklearn to actually dispatch in some places. Dispatching just based on >> __array_function__ seems not really feasible so far. >> >> Best, >> Andreas Mueller >> >> >> On 1/6/20 11:29 PM, Stephan Hoyer wrote: >> >> I am pleased to present a new NumPy Enhancement Proposal for discussion: >> "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be >> very welcome! >> >> The full text follows. The rendered proposal can also be found online at >> https://numpy.org/neps/nep-0037-array-module.html >> >> Best, >> Stephan Hoyer >> >> =================================================== >> NEP 37 — A dispatch protocol for NumPy-like modules >> =================================================== >> >> :Author: Stephan Hoyer <shoyer@google.com> >> :Author: Hameer Abbasi >> :Author: Sebastian Berg >> :Status: Draft >> :Type: Standards Track >> :Created: 2019-12-29 >> >> Abstract >> -------- >> >> NEP-18's ``__array_function__`` has been a mixed success. Some projects >> (e.g., >> dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others >> (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a >> new >> protocol, ``__array_module__``, that we expect could eventually subsume >> most >> use-cases for ``__array_function__``. The protocol requires explicit >> adoption >> by both users and library authors, which ensures backwards compatibility, >> and >> is also significantly simpler than ``__array_function__``, both of which >> we >> expect will make it easier to adopt. >> >> Why ``__array_function__`` hasn't been enough >> --------------------------------------------- >> >> There are two broad ways in which NEP-18 has fallen short of its goals: >> >> 1. **Maintainability concerns**. `__array_function__` has significant >> implications for libraries that use it: >> >> - Projects like `PyTorch >> <https://github.com/pytorch/pytorch/issues/22402>`_, `JAX >> <https://github.com/google/jax/issues/1565>`_ and even `scipy.sparse >> <https://github.com/scipy/scipy/issues/10362>`_ have been reluctant >> to >> implement `__array_function__` in part because they are concerned >> about >> **breaking existing code**: users expect NumPy functions like >> ``np.concatenate`` to return NumPy arrays. This is a fundamental >> limitation of the ``__array_function__`` design, which we chose to >> allow >> overriding the existing ``numpy`` namespace. >> - ``__array_function__`` currently requires an "all or nothing" >> approach to >> implementing NumPy's API. There is no good pathway for **incremental >> adoption**, which is particularly problematic for established >> projects >> for which adopting ``__array_function__`` would result in breaking >> changes. >> - It is no longer possible to use **aliases to NumPy functions** within >> modules that support overrides. For example, both CuPy and JAX set >> ``result_type = np.result_type``. >> - Implementing **fall-back mechanisms** for unimplemented NumPy >> functions >> by using NumPy's implementation is hard to get right (but see the >> `version from dask <https://github.com/dask/dask/pull/5043>`_), >> because >> ``__array_function__`` does not present a consistent interface. >> Converting all arguments of array type requires recursing into >> generic >> arguments of the form ``*args, **kwargs``. >> >> 2. **Limitations on what can be overridden.** ``__array_function__`` has >> some >> important gaps, most notably array creation and coercion functions: >> >> - **Array creation** routines (e.g., ``np.arange`` and those in >> ``np.random``) need some other mechanism for indicating what type of >> arrays to create. `NEP 36 <https://github.com/numpy/numpy/pull/14715 >> >`_ >> proposed adding optional ``like=`` arguments to functions without >> existing array arguments. However, we still lack any mechanism to >> override methods on objects, such as those needed by >> ``np.random.RandomState``. >> - **Array conversion** can't reuse the existing coercion functions like >> ``np.asarray``, because ``np.asarray`` sometimes means "convert to an >> exact ``np.ndarray``" and other times means "convert to something >> _like_ >> a NumPy array." This led to the `NEP 30 >> <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_ >> proposal for >> a separate ``np.duckarray`` function, but this still does not >> resolve how >> to cast one duck array into a type matching another duck array. >> >> ``get_array_module`` and the ``__array_module__`` protocol >> ---------------------------------------------------------- >> >> We propose a new user-facing mechanism for dispatching to a duck-array >> implementation, ``numpy.get_array_module``. ``get_array_module`` performs >> the >> same type resolution as ``__array_function__`` and returns a module with >> an API >> promised to match the standard interface of ``numpy`` that can implement >> operations on all provided array types. >> >> The protocol itself is both simpler and more powerful than >> ``__array_function__``, because it doesn't need to worry about actually >> implementing functions. We believe it resolves most of the >> maintainability and >> functionality limitations of ``__array_function__``. >> >> The new protocol is opt-in, explicit and with local control; see >> :ref:`appendix-design-choices` for discussion on the importance of these >> design >> features. >> >> The array module contract >> ========================= >> >> Modules returned by ``get_array_module``/``__array_module__`` should make >> a >> best effort to implement NumPy's core functionality on new array types(s). >> Unimplemented functionality should simply be omitted (e.g., accessing an >> unimplemented function should raise ``AttributeError``). In the future, we >> anticipate codifying a protocol for requesting restricted subsets of >> ``numpy``; >> see :ref:`requesting-restricted-subsets` for more details. >> >> How to use ``get_array_module`` >> =============================== >> >> Code that wants to support generic duck arrays should explicitly call >> ``get_array_module`` to determine an appropriate array module from which >> to >> call functions, rather than using the ``numpy`` namespace directly. For >> example: >> >> .. code:: python >> >> # calls the appropriate version of np.something for x and y >> module = np.get_array_module(x, y) >> module.something(x, y) >> >> Both array creation and array conversion are supported, because >> dispatching is >> handled by ``get_array_module`` rather than via the types of function >> arguments. For example, to use random number generation functions or >> methods, >> we can simply pull out the appropriate submodule: >> >> .. code:: python >> >> def duckarray_add_random(array): >> module = np.get_array_module(array) >> noise = module.random.randn(*array.shape) >> return array + noise >> >> We can also write the duck-array ``stack`` function from `NEP 30 >> <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_, without >> the need >> for a new ``np.duckarray`` function: >> >> .. code:: python >> >> def duckarray_stack(arrays): >> module = np.get_array_module(*arrays) >> arrays = [module.asarray(arr) for arr in arrays] >> shapes = {arr.shape for arr in arrays} >> if len(shapes) != 1: >> raise ValueError('all input arrays must have the same shape') >> expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] >> return module.concatenate(expanded_arrays, axis=0) >> >> By default, ``get_array_module`` will return the ``numpy`` module if no >> arguments are arrays. This fall-back can be explicitly controlled by >> providing >> the ``module`` keyword-only argument. It is also possible to indicate >> that an >> exception should be raised instead of returning a default array module by >> setting ``module=None``. >> >> How to implement ``__array_module__`` >> ===================================== >> >> Libraries implementing a duck array type that want to support >> ``get_array_module`` need to implement the corresponding protocol, >> ``__array_module__``. This new protocol is based on Python's dispatch >> protocol >> for arithmetic, and is essentially a simpler version of >> ``__array_function__``. >> >> Only one argument is passed into ``__array_module__``, a Python >> collection of >> unique array types passed into ``get_array_module``, i.e., all arguments >> with >> an ``__array_module__`` attribute. >> >> The special method should either return an namespace with an API matching >> ``numpy``, or ``NotImplemented``, indicating that it does not know how to >> handle the operation: >> >> .. code:: python >> >> class MyArray: >> def __array_module__(self, types): >> if not all(issubclass(t, MyArray) for t in types): >> return NotImplemented >> return my_array_module >> >> Returning custom objects from ``__array_module__`` >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> ``my_array_module`` will typically, but need not always, be a Python >> module. >> Returning a custom objects (e.g., with functions implemented via >> ``__getattr__``) may be useful for some advanced use cases. >> >> For example, custom objects could allow for partial implementations of >> duck >> array modules that fall-back to NumPy (although this is not recommended in >> general because such fall-back behavior can be error prone): >> >> .. code:: python >> >> class MyArray: >> def __array_module__(self, types): >> if all(issubclass(t, MyArray) for t in types): >> return ArrayModule() >> else: >> return NotImplemented >> >> class ArrayModule: >> def __getattr__(self, name): >> import base_module >> return getattr(base_module, name, getattr(numpy, name)) >> >> Subclassing from ``numpy.ndarray`` >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> All of the same guidance about well-defined type casting hierarchies from >> NEP-18 still applies. ``numpy.ndarray`` itself contains a matching >> implementation of ``__array_module__``, which is convenient for >> subclasses: >> >> .. code:: python >> >> class ndarray: >> def __array_module__(self, types): >> if all(issubclass(t, ndarray) for t in types): >> return numpy >> else: >> return NotImplemented >> >> NumPy's internal machinery >> ========================== >> >> The type resolution rules of ``get_array_module`` follow the same model as >> Python and NumPy's existing dispatch protocols: subclasses are called >> before >> super-classes, and otherwise left to right. ``__array_module__`` is >> guaranteed >> to be called only a single time on each unique type. >> >> The actual implementation of `get_array_module` will be in C, but should >> be >> equivalent to this Python code: >> >> .. code:: python >> >> def get_array_module(*arrays, default=numpy): >> implementing_arrays, types = >> _implementing_arrays_and_types(arrays) >> if not implementing_arrays and default is not None: >> return default >> for array in implementing_arrays: >> module = array.__array_module__(types) >> if module is not NotImplemented: >> return module >> raise TypeError("no common array module found") >> >> def _implementing_arrays_and_types(relevant_arrays): >> types = [] >> implementing_arrays = [] >> for array in relevant_arrays: >> t = type(array) >> if t not in types and hasattr(t, '__array_module__'): >> types.append(t) >> # Subclasses before superclasses, otherwise left to right >> index = len(implementing_arrays) >> for i, old_array in enumerate(implementing_arrays): >> if issubclass(t, type(old_array)): >> index = i >> break >> implementing_arrays.insert(index, array) >> return implementing_arrays, types >> >> Relationship with ``__array_ufunc__`` and ``__array_function__`` >> ---------------------------------------------------------------- >> >> These older protocols have distinct use-cases and should remain >> =============================================================== >> >> ``__array_module__`` is intended to resolve limitations of >> ``__array_function__``, so it is natural to consider whether it could >> entirely >> replace ``__array_function__``. This would offer dual benefits: (1) >> simplifying >> the user-story about how to override NumPy and (2) removing the slowdown >> associated with checking for dispatch when calling every NumPy function. >> >> However, ``__array_module__`` and ``__array_function__`` are pretty >> different >> from a user perspective: it requires explicit calls to >> ``get_array_function``, >> rather than simply reusing original ``numpy`` functions. This is probably >> fine >> for *libraries* that rely on duck-arrays, but may be frustratingly >> verbose for >> interactive use. >> >> Some of the dispatching use-cases for ``__array_ufunc__`` are also solved >> by >> ``__array_module__``, but not all of them. For example, it is still >> useful to >> be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a >> generic way >> on non-NumPy arrays (e.g., with dask.array). >> >> Given their existing adoption and distinct use cases, we don't think it >> makes >> sense to remove or deprecate ``__array_function__`` and >> ``__array_ufunc__`` at >> this time. >> >> Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` >> ========================================================================= >> >> Despite the user-facing differences, ``__array_module__`` and a module >> implementing NumPy's API still contain sufficient functionality needed to >> implement dispatching with the existing duck array protocols. >> >> For example, the following mixin classes would provide sensible defaults >> for >> these special methods in terms of ``get_array_module`` and >> ``__array_module__``: >> >> .. code:: python >> >> class ArrayUfuncFromModuleMixin: >> >> def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): >> arrays = inputs + kwargs.get('out', ()) >> try: >> array_module = np.get_array_module(*arrays) >> except TypeError: >> return NotImplemented >> >> try: >> # Note this may have false positive matches, if >> ufunc.__name__ >> # matches the name of a ufunc defined by NumPy. >> Unfortunately >> # there is no way to determine in which module a ufunc was >> # defined. >> new_ufunc = getattr(array_module, ufunc.__name__) >> except AttributeError: >> return NotImplemented >> >> try: >> callable = getattr(new_ufunc, method) >> except AttributeError: >> return NotImplemented >> >> return callable(*inputs, **kwargs) >> >> class ArrayFunctionFromModuleMixin: >> >> def __array_function__(self, func, types, args, kwargs): >> array_module = self.__array_module__(types) >> if array_module is NotImplemented: >> return NotImplemented >> >> # Traverse submodules to find the appropriate function >> modules = func.__module__.split('.') >> assert modules[0] == 'numpy' >> for submodule in modules[1:]: >> module = getattr(module, submodule, None) >> new_func = getattr(module, func.__name__, None) >> if new_func is None: >> return NotImplemented >> >> return new_func(*args, **kwargs) >> >> To make it easier to write duck arrays, we could also add these mixin >> classes >> into ``numpy.lib.mixins`` (but the examples above may suffice). >> >> Alternatives considered >> ----------------------- >> >> Naming >> ====== >> >> We like the name ``__array_module__`` because it mirrors the existing >> ``__array_function__`` and ``__array_ufunc__`` protocols. Another >> reasonable >> choice could be ``__array_namespace__``. >> >> It is less clear what the NumPy function that calls this protocol should >> be >> called (``get_array_module`` in this proposal). Some possible >> alternatives: >> ``array_module``, ``common_array_module``, ``resolve_array_module``, >> ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, >> ``get_duck_array_module``. >> >> .. _requesting-restricted-subsets: >> >> Requesting restricted subsets of NumPy's API >> ============================================ >> >> Over time, NumPy has accumulated a very large API surface, with over 600 >> attributes in the top level ``numpy`` module alone. It is unlikely that >> any >> duck array library could or would want to implement all of these >> functions and >> classes, because the frequently used subset of NumPy is much smaller. >> >> We think it would be useful exercise to define "minimal" subset(s) of >> NumPy's >> API, omitting rarely used or non-recommended functionality. For example, >> minimal NumPy might include ``stack``, but not the other stacking >> functions >> ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could >> clearly >> indicate to duck array authors and users want functionality is core and >> what >> functionality they can skip. >> >> Support for requesting a restricted subset of NumPy's API would be a >> natural >> feature to include in ``get_array_function`` and ``__array_module__``, >> e.g., >> >> .. code:: python >> >> # array_module is only guaranteed to contain "minimal" NumPy >> array_module = np.get_array_module(*arrays, request='minimal') >> >> To facilitate testing with NumPy and use with any valid duck array >> library, >> NumPy itself would return restricted versions of the ``numpy`` module when >> ``get_array_module`` is called only on NumPy arrays. Omitted functions >> would >> simply not exist. >> >> Unfortunately, we have not yet figured out what these restricted subsets >> should >> be, so it doesn't make sense to do this yet. When/if we do, we could >> either add >> new keyword arguments to ``get_array_module`` or add new top level >> functions, >> e.g., ``get_minimal_array_module``. We would also need to add either a new >> protocol patterned off of ``__array_module__`` (e.g., >> ``__array_module_minimal__``), or could add an optional second argument to >> ``__array_module__`` (catching errors with ``try``/``except``). >> >> A new namespace for implicit dispatch >> ===================================== >> >> Instead of supporting overrides in the main `numpy` namespace with >> ``__array_function__``, we could create a new opt-in namespace, e.g., >> ``numpy.api``, with versions of NumPy functions that support dispatching. >> These >> overrides would need new opt-in protocols, e.g., >> ``__array_function_api__`` >> patterned off of ``__array_function__``. >> >> This would resolve the biggest limitations of ``__array_function__`` by >> being >> opt-in and would also allow for unambiguously overriding functions like >> ``asarray``, because ``np.api.asarray`` would always mean "convert an >> array-like object." But it wouldn't solve all the dispatching needs met >> by >> ``__array_module__``, and would leave us with supporting a considerably >> more >> complex protocol both for array users and implementors. >> >> We could potentially implement such a new namespace *via* the >> ``__array_module__`` protocol. Certainly some users would find this >> convenient, >> because it is slightly less boilerplate. But this would leave users with a >> confusing choice: when should they use `get_array_module` vs. >> `np.api.something`. Also, we would have to add and maintain a whole new >> module, >> which is considerably more expensive than merely adding a function. >> >> Dispatching on both types and arrays instead of only types >> ========================================================== >> >> Instead of supporting dispatch only via unique array types, we could also >> support dispatch via array objects, e.g., by passing an ``arrays`` >> argument as >> part of the ``__array_module__`` protocol. This could potentially be >> useful for >> dispatch for arrays with metadata, such provided by Dask and Pint, but >> would >> impose costs in terms of type safety and complexity. >> >> For example, a library that supports arrays on both CPUs and GPUs might >> decide >> on which device to create a new arrays from functions like ``ones`` based >> on >> input arguments: >> >> .. code:: python >> >> class Array: >> def __array_module__(self, types, arrays): >> useful_arrays = tuple(a in arrays if isinstance(a, Array)) >> if not useful_arrays: >> return NotImplemented >> prefer_gpu = any(a.prefer_gpu for a in useful_arrays) >> return ArrayModule(prefer_gpu) >> >> class ArrayModule: >> def __init__(self, prefer_gpu): >> self.prefer_gpu = prefer_gpu >> >> def __getattr__(self, name): >> import base_module >> base_func = getattr(base_module, name) >> return functools.partial(base_func, >> prefer_gpu=self.prefer_gpu) >> >> This might be useful, but it's not clear if we really need it. Pint seems >> to >> get along OK without any explicit array creation routines (favoring >> multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most >> part >> Dask is also OK with existing ``__array_function__`` style overides (e.g., >> favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an >> array >> on the CPU or GPU could be solved by `making array creation lazy >> <https://github.com/google/jax/pull/1668>`_. >> >> .. _appendix-design-choices: >> >> Appendix: design choices for API overrides >> ------------------------------------------ >> >> There is a large range of possible design choices for overriding NumPy's >> API. >> Here we discuss three major axes of the design decision that guided our >> design >> for ``__array_module__``. >> >> Opt-in vs. opt-out for users >> ============================ >> >> The ``__array_ufunc__`` and ``__array_function__`` protocols provide a >> mechanism for overriding NumPy functions *within NumPy's existing >> namespace*. >> This means that users need to explicitly opt-out if they do not want any >> overridden behavior, e.g., by casting arrays with ``np.asarray()``. >> >> In theory, this approach lowers the barrier for adopting these protocols >> in >> user code and libraries, because code that uses the standard NumPy >> namespace is >> automatically compatible. But in practice, this hasn't worked out. For >> example, >> most well-maintained libraries that use NumPy follow the best practice of >> casting all inputs with ``np.asarray()``, which they would have to >> explicitly >> relax to use ``__array_function__``. Our experience has been that making a >> library compatible with a new duck array type typically requires at least >> a >> small amount of work to accommodate differences in the data model and >> operations >> that can be implemented efficiently. >> >> These opt-out approaches also considerably complicate backwards >> compatibility >> for libraries that adopt these protocols, because by opting in as a >> library >> they also opt-in their users, whether they expect it or not. For winning >> over >> libraries that have been unable to adopt ``__array_function__``, an opt-in >> approach seems like a must. >> >> Explicit vs. implicit choice of implementation >> ============================================== >> >> Both ``__array_ufunc__`` and ``__array_function__`` have implicit control >> over >> dispatching: the dispatched functions are determined via the appropriate >> protocols in every function call. This generalizes well to handling many >> different types of objects, as evidenced by its use for implementing >> arithmetic >> operators in Python, but it has two downsides: >> >> 1. *Speed*: it imposes additional overhead in every function call, >> because each >> function call needs to inspect each of its arguments for overrides. >> This is >> why arithmetic on builtin Python numbers is slow. >> 2. *Readability*: it is not longer immediately evident to readers of code >> what >> happens when a function is called, because the function's >> implementation >> could be overridden by any of its arguments. >> >> In contrast, importing a new library (e.g., ``import dask.array as da``) >> with >> an API matching NumPy is entirely explicit. There is no overhead from >> dispatch >> or ambiguity about which implementation is being used. >> >> Explicit and implicit choice of implementations are not mutually exclusive >> options. Indeed, most implementations of NumPy API overrides via >> ``__array_function__`` that we are familiar with (namely, dask, CuPy and >> sparse, but not Pint) also include an explicit way to use their version of >> NumPy's API by importing a module directly (``dask.array``, ``cupy`` or >> ``sparse``, respectively). >> >> Local vs. non-local vs. global control >> ====================================== >> >> The final design axis is how users control the choice of API: >> >> - **Local control**, as exemplified by multiple dispatch and Python >> protocols for >> arithmetic, determines which implementation to use either by checking >> types >> or calling methods on the direct arguments of a function. >> - **Non-local control** such as `np.errstate >> < >> https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html >> >`_ >> overrides behavior with global-state via function decorators or >> context-managers. Control is determined hierarchically, via the >> inner-most >> context. >> - **Global control** provides a mechanism for users to set default >> behavior, >> either via function calls or configuration files. For example, >> matplotlib >> allows setting a global choice of plotting backend. >> >> Local control is generally considered a best practice for API design, >> because >> control flow is entirely explicit, which makes it the easiest to >> understand. >> Non-local and global control are occasionally used, but generally either >> due to >> ignorance or a lack of better alternatives. >> >> In the case of duck typing for NumPy's public API, we think non-local or >> global >> control would be mistakes, mostly because they **don't compose well**. If >> one >> library sets/needs one set of overrides and then internally calls a >> routine >> that expects another set of overrides, the resulting behavior may be very >> surprising. Higher order functions are especially problematic, because the >> context in which functions are evaluated may not be the context in which >> they >> are defined. >> >> One class of override use cases where we think non-local and global >> control are >> appropriate is for choosing a backend system that is guaranteed to have an >> entirely consistent interface, such as a faster alternative >> implementation of >> ``numpy.fft`` on NumPy arrays. However, these are out of scope for the >> current >> proposal, which is focused on duck arrays. >> >> _______________________________________________ >> NumPy-Discussion mailing listNumPy-Discussion@python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Wed, Feb 5, 2020 at 12:14 PM Eric Wieser <wieser.eric+numpy@gmail.com> wrote: > > scipy.linalg is a superset of numpy.linalg > > This isn't completely accurate - numpy.linalg supports almost all > operations* over stacks of matrices via gufuncs, but scipy.linalg does not > appear to. > > Eric > > *: not lstsq due to an ungeneralizable public API > That's true for `qr` as well I believe. Indeed some functions have diverged slightly, but that's not on purpose, more like a lack of time to coordinate. We would like to fix that so everything is in sync and fully API-compatible again. Ralf > On Wed, 5 Feb 2020 at 17:38, Ralf Gommers <ralf.gommers@gmail.com> wrote: > >> >> >> On Wed, Feb 5, 2020 at 10:01 AM Andreas Mueller <t3kcit@gmail.com> wrote: >> >>> A bit late to the NEP 37 party. >>> I just wanted to say that at least from my perspective it seems a great >>> solution that will help sklearn move towards more flexible compute engines. >>> I think one of the biggest issues is array creation (including random >>> arrays), and that's handled quite nicely with NEP 37. >>> >>> There's some discussion on the scikit-learn side here: >>> https://github.com/scikit-learn/scikit-learn/pull/14963 >>> https://github.com/scikit-learn/scikit-learn/issues/11447 >>> >>> Two different groups of people tried to use __array_function__ to >>> delegate to MxNet and CuPy respectively in scikit-learn, and ran into the >>> same issues. >>> >>> There's some remaining issues in sklearn that will not be handled by NEP >>> 37 but they go beyond NumPy in some sense. >>> Just to briefly bring them up: >>> >>> - We use scipy.linalg in many places, and we would need to do a separate >>> dispatching to check whether we can use module.linalg instead >>> (that might be an issue for many libraries but I'm not sure). >>> >> >> That is an issue, and goes in the opposite direction we need - >> scipy.linalg is a superset of numpy.linalg, so we'd like to encourage using >> scipy. This is something we may want to consider fixing by making the >> dispatch decorator public in numpy and adopting in scipy. >> >> Cheers, >> Ralf >> >> >> >>> >>> - Some models have several possible optimization algorithms, some of >>> which are pure numpy and some which are Cython. If someone provides a >>> different array module, >>> we might want to choose an algorithm that is actually supported by that >>> module. While this exact issue is maybe sklearn specific, a similar issue >>> could appear for most downstream libs that use Cython in some places. >>> Many Cython algorithms could be implemented in pure numpy with a >>> potential slowdown, but once we have NEP 37 there might be a benefit to >>> having a pure NumPy implementation as an alternative code path. >>> >>> >>> Anyway, NEP 37 seems a great step in the right direction and would >>> enable sklearn to actually dispatch in some places. Dispatching just based >>> on __array_function__ seems not really feasible so far. >>> >>> Best, >>> Andreas Mueller >>> >>> >>> On 1/6/20 11:29 PM, Stephan Hoyer wrote: >>> >>> I am pleased to present a new NumPy Enhancement Proposal for discussion: >>> "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be >>> very welcome! >>> >>> The full text follows. The rendered proposal can also be found online at >>> https://numpy.org/neps/nep-0037-array-module.html >>> >>> Best, >>> Stephan Hoyer >>> >>> =================================================== >>> NEP 37 — A dispatch protocol for NumPy-like modules >>> =================================================== >>> >>> :Author: Stephan Hoyer <shoyer@google.com> >>> :Author: Hameer Abbasi >>> :Author: Sebastian Berg >>> :Status: Draft >>> :Type: Standards Track >>> :Created: 2019-12-29 >>> >>> Abstract >>> -------- >>> >>> NEP-18's ``__array_function__`` has been a mixed success. Some projects >>> (e.g., >>> dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. >>> Others >>> (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a >>> new >>> protocol, ``__array_module__``, that we expect could eventually subsume >>> most >>> use-cases for ``__array_function__``. The protocol requires explicit >>> adoption >>> by both users and library authors, which ensures backwards >>> compatibility, and >>> is also significantly simpler than ``__array_function__``, both of which >>> we >>> expect will make it easier to adopt. >>> >>> Why ``__array_function__`` hasn't been enough >>> --------------------------------------------- >>> >>> There are two broad ways in which NEP-18 has fallen short of its goals: >>> >>> 1. **Maintainability concerns**. `__array_function__` has significant >>> implications for libraries that use it: >>> >>> - Projects like `PyTorch >>> <https://github.com/pytorch/pytorch/issues/22402>`_, `JAX >>> <https://github.com/google/jax/issues/1565>`_ and even >>> `scipy.sparse >>> <https://github.com/scipy/scipy/issues/10362>`_ have been >>> reluctant to >>> implement `__array_function__` in part because they are concerned >>> about >>> **breaking existing code**: users expect NumPy functions like >>> ``np.concatenate`` to return NumPy arrays. This is a fundamental >>> limitation of the ``__array_function__`` design, which we chose to >>> allow >>> overriding the existing ``numpy`` namespace. >>> - ``__array_function__`` currently requires an "all or nothing" >>> approach to >>> implementing NumPy's API. There is no good pathway for **incremental >>> adoption**, which is particularly problematic for established >>> projects >>> for which adopting ``__array_function__`` would result in breaking >>> changes. >>> - It is no longer possible to use **aliases to NumPy functions** >>> within >>> modules that support overrides. For example, both CuPy and JAX set >>> ``result_type = np.result_type``. >>> - Implementing **fall-back mechanisms** for unimplemented NumPy >>> functions >>> by using NumPy's implementation is hard to get right (but see the >>> `version from dask <https://github.com/dask/dask/pull/5043>`_), >>> because >>> ``__array_function__`` does not present a consistent interface. >>> Converting all arguments of array type requires recursing into >>> generic >>> arguments of the form ``*args, **kwargs``. >>> >>> 2. **Limitations on what can be overridden.** ``__array_function__`` has >>> some >>> important gaps, most notably array creation and coercion functions: >>> >>> - **Array creation** routines (e.g., ``np.arange`` and those in >>> ``np.random``) need some other mechanism for indicating what type of >>> arrays to create. `NEP 36 < >>> https://github.com/numpy/numpy/pull/14715>`_ >>> proposed adding optional ``like=`` arguments to functions without >>> existing array arguments. However, we still lack any mechanism to >>> override methods on objects, such as those needed by >>> ``np.random.RandomState``. >>> - **Array conversion** can't reuse the existing coercion functions >>> like >>> ``np.asarray``, because ``np.asarray`` sometimes means "convert to >>> an >>> exact ``np.ndarray``" and other times means "convert to something >>> _like_ >>> a NumPy array." This led to the `NEP 30 >>> <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_ >>> proposal for >>> a separate ``np.duckarray`` function, but this still does not >>> resolve how >>> to cast one duck array into a type matching another duck array. >>> >>> ``get_array_module`` and the ``__array_module__`` protocol >>> ---------------------------------------------------------- >>> >>> We propose a new user-facing mechanism for dispatching to a duck-array >>> implementation, ``numpy.get_array_module``. ``get_array_module`` >>> performs the >>> same type resolution as ``__array_function__`` and returns a module with >>> an API >>> promised to match the standard interface of ``numpy`` that can implement >>> operations on all provided array types. >>> >>> The protocol itself is both simpler and more powerful than >>> ``__array_function__``, because it doesn't need to worry about actually >>> implementing functions. We believe it resolves most of the >>> maintainability and >>> functionality limitations of ``__array_function__``. >>> >>> The new protocol is opt-in, explicit and with local control; see >>> :ref:`appendix-design-choices` for discussion on the importance of these >>> design >>> features. >>> >>> The array module contract >>> ========================= >>> >>> Modules returned by ``get_array_module``/``__array_module__`` should >>> make a >>> best effort to implement NumPy's core functionality on new array >>> types(s). >>> Unimplemented functionality should simply be omitted (e.g., accessing an >>> unimplemented function should raise ``AttributeError``). In the future, >>> we >>> anticipate codifying a protocol for requesting restricted subsets of >>> ``numpy``; >>> see :ref:`requesting-restricted-subsets` for more details. >>> >>> How to use ``get_array_module`` >>> =============================== >>> >>> Code that wants to support generic duck arrays should explicitly call >>> ``get_array_module`` to determine an appropriate array module from which >>> to >>> call functions, rather than using the ``numpy`` namespace directly. For >>> example: >>> >>> .. code:: python >>> >>> # calls the appropriate version of np.something for x and y >>> module = np.get_array_module(x, y) >>> module.something(x, y) >>> >>> Both array creation and array conversion are supported, because >>> dispatching is >>> handled by ``get_array_module`` rather than via the types of function >>> arguments. For example, to use random number generation functions or >>> methods, >>> we can simply pull out the appropriate submodule: >>> >>> .. code:: python >>> >>> def duckarray_add_random(array): >>> module = np.get_array_module(array) >>> noise = module.random.randn(*array.shape) >>> return array + noise >>> >>> We can also write the duck-array ``stack`` function from `NEP 30 >>> <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_, without >>> the need >>> for a new ``np.duckarray`` function: >>> >>> .. code:: python >>> >>> def duckarray_stack(arrays): >>> module = np.get_array_module(*arrays) >>> arrays = [module.asarray(arr) for arr in arrays] >>> shapes = {arr.shape for arr in arrays} >>> if len(shapes) != 1: >>> raise ValueError('all input arrays must have the same shape') >>> expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] >>> return module.concatenate(expanded_arrays, axis=0) >>> >>> By default, ``get_array_module`` will return the ``numpy`` module if no >>> arguments are arrays. This fall-back can be explicitly controlled by >>> providing >>> the ``module`` keyword-only argument. It is also possible to indicate >>> that an >>> exception should be raised instead of returning a default array module by >>> setting ``module=None``. >>> >>> How to implement ``__array_module__`` >>> ===================================== >>> >>> Libraries implementing a duck array type that want to support >>> ``get_array_module`` need to implement the corresponding protocol, >>> ``__array_module__``. This new protocol is based on Python's dispatch >>> protocol >>> for arithmetic, and is essentially a simpler version of >>> ``__array_function__``. >>> >>> Only one argument is passed into ``__array_module__``, a Python >>> collection of >>> unique array types passed into ``get_array_module``, i.e., all arguments >>> with >>> an ``__array_module__`` attribute. >>> >>> The special method should either return an namespace with an API matching >>> ``numpy``, or ``NotImplemented``, indicating that it does not know how to >>> handle the operation: >>> >>> .. code:: python >>> >>> class MyArray: >>> def __array_module__(self, types): >>> if not all(issubclass(t, MyArray) for t in types): >>> return NotImplemented >>> return my_array_module >>> >>> Returning custom objects from ``__array_module__`` >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> ``my_array_module`` will typically, but need not always, be a Python >>> module. >>> Returning a custom objects (e.g., with functions implemented via >>> ``__getattr__``) may be useful for some advanced use cases. >>> >>> For example, custom objects could allow for partial implementations of >>> duck >>> array modules that fall-back to NumPy (although this is not recommended >>> in >>> general because such fall-back behavior can be error prone): >>> >>> .. code:: python >>> >>> class MyArray: >>> def __array_module__(self, types): >>> if all(issubclass(t, MyArray) for t in types): >>> return ArrayModule() >>> else: >>> return NotImplemented >>> >>> class ArrayModule: >>> def __getattr__(self, name): >>> import base_module >>> return getattr(base_module, name, getattr(numpy, name)) >>> >>> Subclassing from ``numpy.ndarray`` >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> All of the same guidance about well-defined type casting hierarchies from >>> NEP-18 still applies. ``numpy.ndarray`` itself contains a matching >>> implementation of ``__array_module__``, which is convenient for >>> subclasses: >>> >>> .. code:: python >>> >>> class ndarray: >>> def __array_module__(self, types): >>> if all(issubclass(t, ndarray) for t in types): >>> return numpy >>> else: >>> return NotImplemented >>> >>> NumPy's internal machinery >>> ========================== >>> >>> The type resolution rules of ``get_array_module`` follow the same model >>> as >>> Python and NumPy's existing dispatch protocols: subclasses are called >>> before >>> super-classes, and otherwise left to right. ``__array_module__`` is >>> guaranteed >>> to be called only a single time on each unique type. >>> >>> The actual implementation of `get_array_module` will be in C, but should >>> be >>> equivalent to this Python code: >>> >>> .. code:: python >>> >>> def get_array_module(*arrays, default=numpy): >>> implementing_arrays, types = >>> _implementing_arrays_and_types(arrays) >>> if not implementing_arrays and default is not None: >>> return default >>> for array in implementing_arrays: >>> module = array.__array_module__(types) >>> if module is not NotImplemented: >>> return module >>> raise TypeError("no common array module found") >>> >>> def _implementing_arrays_and_types(relevant_arrays): >>> types = [] >>> implementing_arrays = [] >>> for array in relevant_arrays: >>> t = type(array) >>> if t not in types and hasattr(t, '__array_module__'): >>> types.append(t) >>> # Subclasses before superclasses, otherwise left to right >>> index = len(implementing_arrays) >>> for i, old_array in enumerate(implementing_arrays): >>> if issubclass(t, type(old_array)): >>> index = i >>> break >>> implementing_arrays.insert(index, array) >>> return implementing_arrays, types >>> >>> Relationship with ``__array_ufunc__`` and ``__array_function__`` >>> ---------------------------------------------------------------- >>> >>> These older protocols have distinct use-cases and should remain >>> =============================================================== >>> >>> ``__array_module__`` is intended to resolve limitations of >>> ``__array_function__``, so it is natural to consider whether it could >>> entirely >>> replace ``__array_function__``. This would offer dual benefits: (1) >>> simplifying >>> the user-story about how to override NumPy and (2) removing the slowdown >>> associated with checking for dispatch when calling every NumPy function. >>> >>> However, ``__array_module__`` and ``__array_function__`` are pretty >>> different >>> from a user perspective: it requires explicit calls to >>> ``get_array_function``, >>> rather than simply reusing original ``numpy`` functions. This is >>> probably fine >>> for *libraries* that rely on duck-arrays, but may be frustratingly >>> verbose for >>> interactive use. >>> >>> Some of the dispatching use-cases for ``__array_ufunc__`` are also >>> solved by >>> ``__array_module__``, but not all of them. For example, it is still >>> useful to >>> be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a >>> generic way >>> on non-NumPy arrays (e.g., with dask.array). >>> >>> Given their existing adoption and distinct use cases, we don't think it >>> makes >>> sense to remove or deprecate ``__array_function__`` and >>> ``__array_ufunc__`` at >>> this time. >>> >>> Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` >>> ========================================================================= >>> >>> Despite the user-facing differences, ``__array_module__`` and a module >>> implementing NumPy's API still contain sufficient functionality needed to >>> implement dispatching with the existing duck array protocols. >>> >>> For example, the following mixin classes would provide sensible defaults >>> for >>> these special methods in terms of ``get_array_module`` and >>> ``__array_module__``: >>> >>> .. code:: python >>> >>> class ArrayUfuncFromModuleMixin: >>> >>> def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): >>> arrays = inputs + kwargs.get('out', ()) >>> try: >>> array_module = np.get_array_module(*arrays) >>> except TypeError: >>> return NotImplemented >>> >>> try: >>> # Note this may have false positive matches, if >>> ufunc.__name__ >>> # matches the name of a ufunc defined by NumPy. >>> Unfortunately >>> # there is no way to determine in which module a ufunc >>> was >>> # defined. >>> new_ufunc = getattr(array_module, ufunc.__name__) >>> except AttributeError: >>> return NotImplemented >>> >>> try: >>> callable = getattr(new_ufunc, method) >>> except AttributeError: >>> return NotImplemented >>> >>> return callable(*inputs, **kwargs) >>> >>> class ArrayFunctionFromModuleMixin: >>> >>> def __array_function__(self, func, types, args, kwargs): >>> array_module = self.__array_module__(types) >>> if array_module is NotImplemented: >>> return NotImplemented >>> >>> # Traverse submodules to find the appropriate function >>> modules = func.__module__.split('.') >>> assert modules[0] == 'numpy' >>> for submodule in modules[1:]: >>> module = getattr(module, submodule, None) >>> new_func = getattr(module, func.__name__, None) >>> if new_func is None: >>> return NotImplemented >>> >>> return new_func(*args, **kwargs) >>> >>> To make it easier to write duck arrays, we could also add these mixin >>> classes >>> into ``numpy.lib.mixins`` (but the examples above may suffice). >>> >>> Alternatives considered >>> ----------------------- >>> >>> Naming >>> ====== >>> >>> We like the name ``__array_module__`` because it mirrors the existing >>> ``__array_function__`` and ``__array_ufunc__`` protocols. Another >>> reasonable >>> choice could be ``__array_namespace__``. >>> >>> It is less clear what the NumPy function that calls this protocol should >>> be >>> called (``get_array_module`` in this proposal). Some possible >>> alternatives: >>> ``array_module``, ``common_array_module``, ``resolve_array_module``, >>> ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, >>> ``get_duck_array_module``. >>> >>> .. _requesting-restricted-subsets: >>> >>> Requesting restricted subsets of NumPy's API >>> ============================================ >>> >>> Over time, NumPy has accumulated a very large API surface, with over 600 >>> attributes in the top level ``numpy`` module alone. It is unlikely that >>> any >>> duck array library could or would want to implement all of these >>> functions and >>> classes, because the frequently used subset of NumPy is much smaller. >>> >>> We think it would be useful exercise to define "minimal" subset(s) of >>> NumPy's >>> API, omitting rarely used or non-recommended functionality. For example, >>> minimal NumPy might include ``stack``, but not the other stacking >>> functions >>> ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could >>> clearly >>> indicate to duck array authors and users want functionality is core and >>> what >>> functionality they can skip. >>> >>> Support for requesting a restricted subset of NumPy's API would be a >>> natural >>> feature to include in ``get_array_function`` and ``__array_module__``, >>> e.g., >>> >>> .. code:: python >>> >>> # array_module is only guaranteed to contain "minimal" NumPy >>> array_module = np.get_array_module(*arrays, request='minimal') >>> >>> To facilitate testing with NumPy and use with any valid duck array >>> library, >>> NumPy itself would return restricted versions of the ``numpy`` module >>> when >>> ``get_array_module`` is called only on NumPy arrays. Omitted functions >>> would >>> simply not exist. >>> >>> Unfortunately, we have not yet figured out what these restricted subsets >>> should >>> be, so it doesn't make sense to do this yet. When/if we do, we could >>> either add >>> new keyword arguments to ``get_array_module`` or add new top level >>> functions, >>> e.g., ``get_minimal_array_module``. We would also need to add either a >>> new >>> protocol patterned off of ``__array_module__`` (e.g., >>> ``__array_module_minimal__``), or could add an optional second argument >>> to >>> ``__array_module__`` (catching errors with ``try``/``except``). >>> >>> A new namespace for implicit dispatch >>> ===================================== >>> >>> Instead of supporting overrides in the main `numpy` namespace with >>> ``__array_function__``, we could create a new opt-in namespace, e.g., >>> ``numpy.api``, with versions of NumPy functions that support >>> dispatching. These >>> overrides would need new opt-in protocols, e.g., >>> ``__array_function_api__`` >>> patterned off of ``__array_function__``. >>> >>> This would resolve the biggest limitations of ``__array_function__`` by >>> being >>> opt-in and would also allow for unambiguously overriding functions like >>> ``asarray``, because ``np.api.asarray`` would always mean "convert an >>> array-like object." But it wouldn't solve all the dispatching needs met >>> by >>> ``__array_module__``, and would leave us with supporting a considerably >>> more >>> complex protocol both for array users and implementors. >>> >>> We could potentially implement such a new namespace *via* the >>> ``__array_module__`` protocol. Certainly some users would find this >>> convenient, >>> because it is slightly less boilerplate. But this would leave users with >>> a >>> confusing choice: when should they use `get_array_module` vs. >>> `np.api.something`. Also, we would have to add and maintain a whole new >>> module, >>> which is considerably more expensive than merely adding a function. >>> >>> Dispatching on both types and arrays instead of only types >>> ========================================================== >>> >>> Instead of supporting dispatch only via unique array types, we could also >>> support dispatch via array objects, e.g., by passing an ``arrays`` >>> argument as >>> part of the ``__array_module__`` protocol. This could potentially be >>> useful for >>> dispatch for arrays with metadata, such provided by Dask and Pint, but >>> would >>> impose costs in terms of type safety and complexity. >>> >>> For example, a library that supports arrays on both CPUs and GPUs might >>> decide >>> on which device to create a new arrays from functions like ``ones`` >>> based on >>> input arguments: >>> >>> .. code:: python >>> >>> class Array: >>> def __array_module__(self, types, arrays): >>> useful_arrays = tuple(a in arrays if isinstance(a, Array)) >>> if not useful_arrays: >>> return NotImplemented >>> prefer_gpu = any(a.prefer_gpu for a in useful_arrays) >>> return ArrayModule(prefer_gpu) >>> >>> class ArrayModule: >>> def __init__(self, prefer_gpu): >>> self.prefer_gpu = prefer_gpu >>> >>> def __getattr__(self, name): >>> import base_module >>> base_func = getattr(base_module, name) >>> return functools.partial(base_func, >>> prefer_gpu=self.prefer_gpu) >>> >>> This might be useful, but it's not clear if we really need it. Pint >>> seems to >>> get along OK without any explicit array creation routines (favoring >>> multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the >>> most part >>> Dask is also OK with existing ``__array_function__`` style overides >>> (e.g., >>> favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place >>> an array >>> on the CPU or GPU could be solved by `making array creation lazy >>> <https://github.com/google/jax/pull/1668>`_. >>> >>> .. _appendix-design-choices: >>> >>> Appendix: design choices for API overrides >>> ------------------------------------------ >>> >>> There is a large range of possible design choices for overriding NumPy's >>> API. >>> Here we discuss three major axes of the design decision that guided our >>> design >>> for ``__array_module__``. >>> >>> Opt-in vs. opt-out for users >>> ============================ >>> >>> The ``__array_ufunc__`` and ``__array_function__`` protocols provide a >>> mechanism for overriding NumPy functions *within NumPy's existing >>> namespace*. >>> This means that users need to explicitly opt-out if they do not want any >>> overridden behavior, e.g., by casting arrays with ``np.asarray()``. >>> >>> In theory, this approach lowers the barrier for adopting these protocols >>> in >>> user code and libraries, because code that uses the standard NumPy >>> namespace is >>> automatically compatible. But in practice, this hasn't worked out. For >>> example, >>> most well-maintained libraries that use NumPy follow the best practice of >>> casting all inputs with ``np.asarray()``, which they would have to >>> explicitly >>> relax to use ``__array_function__``. Our experience has been that making >>> a >>> library compatible with a new duck array type typically requires at >>> least a >>> small amount of work to accommodate differences in the data model and >>> operations >>> that can be implemented efficiently. >>> >>> These opt-out approaches also considerably complicate backwards >>> compatibility >>> for libraries that adopt these protocols, because by opting in as a >>> library >>> they also opt-in their users, whether they expect it or not. For winning >>> over >>> libraries that have been unable to adopt ``__array_function__``, an >>> opt-in >>> approach seems like a must. >>> >>> Explicit vs. implicit choice of implementation >>> ============================================== >>> >>> Both ``__array_ufunc__`` and ``__array_function__`` have implicit >>> control over >>> dispatching: the dispatched functions are determined via the appropriate >>> protocols in every function call. This generalizes well to handling many >>> different types of objects, as evidenced by its use for implementing >>> arithmetic >>> operators in Python, but it has two downsides: >>> >>> 1. *Speed*: it imposes additional overhead in every function call, >>> because each >>> function call needs to inspect each of its arguments for overrides. >>> This is >>> why arithmetic on builtin Python numbers is slow. >>> 2. *Readability*: it is not longer immediately evident to readers of >>> code what >>> happens when a function is called, because the function's >>> implementation >>> could be overridden by any of its arguments. >>> >>> In contrast, importing a new library (e.g., ``import dask.array as >>> da``) with >>> an API matching NumPy is entirely explicit. There is no overhead from >>> dispatch >>> or ambiguity about which implementation is being used. >>> >>> Explicit and implicit choice of implementations are not mutually >>> exclusive >>> options. Indeed, most implementations of NumPy API overrides via >>> ``__array_function__`` that we are familiar with (namely, dask, CuPy and >>> sparse, but not Pint) also include an explicit way to use their version >>> of >>> NumPy's API by importing a module directly (``dask.array``, ``cupy`` or >>> ``sparse``, respectively). >>> >>> Local vs. non-local vs. global control >>> ====================================== >>> >>> The final design axis is how users control the choice of API: >>> >>> - **Local control**, as exemplified by multiple dispatch and Python >>> protocols for >>> arithmetic, determines which implementation to use either by checking >>> types >>> or calling methods on the direct arguments of a function. >>> - **Non-local control** such as `np.errstate >>> < >>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html >>> >`_ >>> overrides behavior with global-state via function decorators or >>> context-managers. Control is determined hierarchically, via the >>> inner-most >>> context. >>> - **Global control** provides a mechanism for users to set default >>> behavior, >>> either via function calls or configuration files. For example, >>> matplotlib >>> allows setting a global choice of plotting backend. >>> >>> Local control is generally considered a best practice for API design, >>> because >>> control flow is entirely explicit, which makes it the easiest to >>> understand. >>> Non-local and global control are occasionally used, but generally either >>> due to >>> ignorance or a lack of better alternatives. >>> >>> In the case of duck typing for NumPy's public API, we think non-local or >>> global >>> control would be mistakes, mostly because they **don't compose well**. >>> If one >>> library sets/needs one set of overrides and then internally calls a >>> routine >>> that expects another set of overrides, the resulting behavior may be very >>> surprising. Higher order functions are especially problematic, because >>> the >>> context in which functions are evaluated may not be the context in which >>> they >>> are defined. >>> >>> One class of override use cases where we think non-local and global >>> control are >>> appropriate is for choosing a backend system that is guaranteed to have >>> an >>> entirely consistent interface, such as a faster alternative >>> implementation of >>> ``numpy.fft`` on NumPy arrays. However, these are out of scope for the >>> current >>> proposal, which is focused on duck arrays. >>> >>> _______________________________________________ >>> NumPy-Discussion mailing listNumPy-Discussion@python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
![](https://secure.gravatar.com/avatar/93a76a800ef6c5919baa8ba91120ee98.jpg?s=120&d=mm&r=g)
On Wed, Feb 5, 2020 at 8:02 AM Andreas Mueller <t3kcit@gmail.com> wrote: > A bit late to the NEP 37 party. > I just wanted to say that at least from my perspective it seems a great > solution that will help sklearn move towards more flexible compute engines. > I think one of the biggest issues is array creation (including random > arrays), and that's handled quite nicely with NEP 37. > Andreas, thanks for sharing your feedback here! Your perspective is really appreciated. > - We use scipy.linalg in many places, and we would need to do a separate > dispatching to check whether we can use module.linalg instead > (that might be an issue for many libraries but I'm not sure). > This brings up a good question -- obviously the final decision here is up to SciPy maintainers, but how should we encourage SciPy to support dispatching? We could pretty easily make __array_function__ cover SciPy by simply exposing NumPy's internal utilities. SciPy could simply use the np.array_function_dispatch decorator internally and that would be enough. It is less clear how this could work for __array_module__, because __array_module__ and get_array_module() are not generic -- they refers explicitly to a NumPy like module. If we want to extend it to SciPy (for which I agree there are good use-cases), what should that look like? The obvious choices would be to either add a new protocol, e.g., __scipy_module__ (but then NumPy needs to know about SciPy), or to add some sort of "module request" parameter to np.get_array_module(), to indicate the requested API, e.g., np.get_array_module(*arrays, matching='scipy'). This is pretty similar to the "default" argument but would need to get passed into the __array_module__ protocol, too. > - Some models have several possible optimization algorithms, some of which > are pure numpy and some which are Cython. If someone provides a different > array module, > we might want to choose an algorithm that is actually supported by that > module. While this exact issue is maybe sklearn specific, a similar issue > could appear for most downstream libs that use Cython in some places. > Many Cython algorithms could be implemented in pure numpy with a > potential slowdown, but once we have NEP 37 there might be a benefit to > having a pure NumPy implementation as an alternative code path. > > > Anyway, NEP 37 seems a great step in the right direction and would enable > sklearn to actually dispatch in some places. Dispatching just based on > __array_function__ seems not really feasible so far. > > Best, > Andreas Mueller > > > On 1/6/20 11:29 PM, Stephan Hoyer wrote: > > I am pleased to present a new NumPy Enhancement Proposal for discussion: > "NEP-37: A dispatch protocol for NumPy-like modules." Feedback would be > very welcome! > > The full text follows. The rendered proposal can also be found online at > https://numpy.org/neps/nep-0037-array-module.html > > Best, > Stephan Hoyer > > =================================================== > NEP 37 — A dispatch protocol for NumPy-like modules > =================================================== > > :Author: Stephan Hoyer <shoyer@google.com> > :Author: Hameer Abbasi > :Author: Sebastian Berg > :Status: Draft > :Type: Standards Track > :Created: 2019-12-29 > > Abstract > -------- > > NEP-18's ``__array_function__`` has been a mixed success. Some projects > (e.g., > dask, CuPy, xarray, sparse, Pint) have enthusiastically adopted it. Others > (e.g., PyTorch, JAX, SciPy) have been more reluctant. Here we propose a new > protocol, ``__array_module__``, that we expect could eventually subsume > most > use-cases for ``__array_function__``. The protocol requires explicit > adoption > by both users and library authors, which ensures backwards compatibility, > and > is also significantly simpler than ``__array_function__``, both of which we > expect will make it easier to adopt. > > Why ``__array_function__`` hasn't been enough > --------------------------------------------- > > There are two broad ways in which NEP-18 has fallen short of its goals: > > 1. **Maintainability concerns**. `__array_function__` has significant > implications for libraries that use it: > > - Projects like `PyTorch > <https://github.com/pytorch/pytorch/issues/22402>`_, `JAX > <https://github.com/google/jax/issues/1565>`_ and even `scipy.sparse > <https://github.com/scipy/scipy/issues/10362>`_ have been reluctant > to > implement `__array_function__` in part because they are concerned > about > **breaking existing code**: users expect NumPy functions like > ``np.concatenate`` to return NumPy arrays. This is a fundamental > limitation of the ``__array_function__`` design, which we chose to > allow > overriding the existing ``numpy`` namespace. > - ``__array_function__`` currently requires an "all or nothing" > approach to > implementing NumPy's API. There is no good pathway for **incremental > adoption**, which is particularly problematic for established projects > for which adopting ``__array_function__`` would result in breaking > changes. > - It is no longer possible to use **aliases to NumPy functions** within > modules that support overrides. For example, both CuPy and JAX set > ``result_type = np.result_type``. > - Implementing **fall-back mechanisms** for unimplemented NumPy > functions > by using NumPy's implementation is hard to get right (but see the > `version from dask <https://github.com/dask/dask/pull/5043>`_), > because > ``__array_function__`` does not present a consistent interface. > Converting all arguments of array type requires recursing into generic > arguments of the form ``*args, **kwargs``. > > 2. **Limitations on what can be overridden.** ``__array_function__`` has > some > important gaps, most notably array creation and coercion functions: > > - **Array creation** routines (e.g., ``np.arange`` and those in > ``np.random``) need some other mechanism for indicating what type of > arrays to create. `NEP 36 <https://github.com/numpy/numpy/pull/14715 > >`_ > proposed adding optional ``like=`` arguments to functions without > existing array arguments. However, we still lack any mechanism to > override methods on objects, such as those needed by > ``np.random.RandomState``. > - **Array conversion** can't reuse the existing coercion functions like > ``np.asarray``, because ``np.asarray`` sometimes means "convert to an > exact ``np.ndarray``" and other times means "convert to something > _like_ > a NumPy array." This led to the `NEP 30 > <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_ > proposal for > a separate ``np.duckarray`` function, but this still does not resolve > how > to cast one duck array into a type matching another duck array. > > ``get_array_module`` and the ``__array_module__`` protocol > ---------------------------------------------------------- > > We propose a new user-facing mechanism for dispatching to a duck-array > implementation, ``numpy.get_array_module``. ``get_array_module`` performs > the > same type resolution as ``__array_function__`` and returns a module with > an API > promised to match the standard interface of ``numpy`` that can implement > operations on all provided array types. > > The protocol itself is both simpler and more powerful than > ``__array_function__``, because it doesn't need to worry about actually > implementing functions. We believe it resolves most of the maintainability > and > functionality limitations of ``__array_function__``. > > The new protocol is opt-in, explicit and with local control; see > :ref:`appendix-design-choices` for discussion on the importance of these > design > features. > > The array module contract > ========================= > > Modules returned by ``get_array_module``/``__array_module__`` should make a > best effort to implement NumPy's core functionality on new array types(s). > Unimplemented functionality should simply be omitted (e.g., accessing an > unimplemented function should raise ``AttributeError``). In the future, we > anticipate codifying a protocol for requesting restricted subsets of > ``numpy``; > see :ref:`requesting-restricted-subsets` for more details. > > How to use ``get_array_module`` > =============================== > > Code that wants to support generic duck arrays should explicitly call > ``get_array_module`` to determine an appropriate array module from which to > call functions, rather than using the ``numpy`` namespace directly. For > example: > > .. code:: python > > # calls the appropriate version of np.something for x and y > module = np.get_array_module(x, y) > module.something(x, y) > > Both array creation and array conversion are supported, because > dispatching is > handled by ``get_array_module`` rather than via the types of function > arguments. For example, to use random number generation functions or > methods, > we can simply pull out the appropriate submodule: > > .. code:: python > > def duckarray_add_random(array): > module = np.get_array_module(array) > noise = module.random.randn(*array.shape) > return array + noise > > We can also write the duck-array ``stack`` function from `NEP 30 > <https://numpy.org/neps/nep-0030-duck-array-protocol.html>`_, without the > need > for a new ``np.duckarray`` function: > > .. code:: python > > def duckarray_stack(arrays): > module = np.get_array_module(*arrays) > arrays = [module.asarray(arr) for arr in arrays] > shapes = {arr.shape for arr in arrays} > if len(shapes) != 1: > raise ValueError('all input arrays must have the same shape') > expanded_arrays = [arr[module.newaxis, ...] for arr in arrays] > return module.concatenate(expanded_arrays, axis=0) > > By default, ``get_array_module`` will return the ``numpy`` module if no > arguments are arrays. This fall-back can be explicitly controlled by > providing > the ``module`` keyword-only argument. It is also possible to indicate that > an > exception should be raised instead of returning a default array module by > setting ``module=None``. > > How to implement ``__array_module__`` > ===================================== > > Libraries implementing a duck array type that want to support > ``get_array_module`` need to implement the corresponding protocol, > ``__array_module__``. This new protocol is based on Python's dispatch > protocol > for arithmetic, and is essentially a simpler version of > ``__array_function__``. > > Only one argument is passed into ``__array_module__``, a Python collection > of > unique array types passed into ``get_array_module``, i.e., all arguments > with > an ``__array_module__`` attribute. > > The special method should either return an namespace with an API matching > ``numpy``, or ``NotImplemented``, indicating that it does not know how to > handle the operation: > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if not all(issubclass(t, MyArray) for t in types): > return NotImplemented > return my_array_module > > Returning custom objects from ``__array_module__`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ``my_array_module`` will typically, but need not always, be a Python > module. > Returning a custom objects (e.g., with functions implemented via > ``__getattr__``) may be useful for some advanced use cases. > > For example, custom objects could allow for partial implementations of duck > array modules that fall-back to NumPy (although this is not recommended in > general because such fall-back behavior can be error prone): > > .. code:: python > > class MyArray: > def __array_module__(self, types): > if all(issubclass(t, MyArray) for t in types): > return ArrayModule() > else: > return NotImplemented > > class ArrayModule: > def __getattr__(self, name): > import base_module > return getattr(base_module, name, getattr(numpy, name)) > > Subclassing from ``numpy.ndarray`` > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > All of the same guidance about well-defined type casting hierarchies from > NEP-18 still applies. ``numpy.ndarray`` itself contains a matching > implementation of ``__array_module__``, which is convenient for > subclasses: > > .. code:: python > > class ndarray: > def __array_module__(self, types): > if all(issubclass(t, ndarray) for t in types): > return numpy > else: > return NotImplemented > > NumPy's internal machinery > ========================== > > The type resolution rules of ``get_array_module`` follow the same model as > Python and NumPy's existing dispatch protocols: subclasses are called > before > super-classes, and otherwise left to right. ``__array_module__`` is > guaranteed > to be called only a single time on each unique type. > > The actual implementation of `get_array_module` will be in C, but should be > equivalent to this Python code: > > .. code:: python > > def get_array_module(*arrays, default=numpy): > implementing_arrays, types = _implementing_arrays_and_types(arrays) > if not implementing_arrays and default is not None: > return default > for array in implementing_arrays: > module = array.__array_module__(types) > if module is not NotImplemented: > return module > raise TypeError("no common array module found") > > def _implementing_arrays_and_types(relevant_arrays): > types = [] > implementing_arrays = [] > for array in relevant_arrays: > t = type(array) > if t not in types and hasattr(t, '__array_module__'): > types.append(t) > # Subclasses before superclasses, otherwise left to right > index = len(implementing_arrays) > for i, old_array in enumerate(implementing_arrays): > if issubclass(t, type(old_array)): > index = i > break > implementing_arrays.insert(index, array) > return implementing_arrays, types > > Relationship with ``__array_ufunc__`` and ``__array_function__`` > ---------------------------------------------------------------- > > These older protocols have distinct use-cases and should remain > =============================================================== > > ``__array_module__`` is intended to resolve limitations of > ``__array_function__``, so it is natural to consider whether it could > entirely > replace ``__array_function__``. This would offer dual benefits: (1) > simplifying > the user-story about how to override NumPy and (2) removing the slowdown > associated with checking for dispatch when calling every NumPy function. > > However, ``__array_module__`` and ``__array_function__`` are pretty > different > from a user perspective: it requires explicit calls to > ``get_array_function``, > rather than simply reusing original ``numpy`` functions. This is probably > fine > for *libraries* that rely on duck-arrays, but may be frustratingly verbose > for > interactive use. > > Some of the dispatching use-cases for ``__array_ufunc__`` are also solved > by > ``__array_module__``, but not all of them. For example, it is still useful > to > be able to define non-NumPy ufuncs (e.g., from Numba or SciPy) in a > generic way > on non-NumPy arrays (e.g., with dask.array). > > Given their existing adoption and distinct use cases, we don't think it > makes > sense to remove or deprecate ``__array_function__`` and > ``__array_ufunc__`` at > this time. > > Mixin classes to implement ``__array_function__`` and ``__array_ufunc__`` > ========================================================================= > > Despite the user-facing differences, ``__array_module__`` and a module > implementing NumPy's API still contain sufficient functionality needed to > implement dispatching with the existing duck array protocols. > > For example, the following mixin classes would provide sensible defaults > for > these special methods in terms of ``get_array_module`` and > ``__array_module__``: > > .. code:: python > > class ArrayUfuncFromModuleMixin: > > def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): > arrays = inputs + kwargs.get('out', ()) > try: > array_module = np.get_array_module(*arrays) > except TypeError: > return NotImplemented > > try: > # Note this may have false positive matches, if > ufunc.__name__ > # matches the name of a ufunc defined by NumPy. > Unfortunately > # there is no way to determine in which module a ufunc was > # defined. > new_ufunc = getattr(array_module, ufunc.__name__) > except AttributeError: > return NotImplemented > > try: > callable = getattr(new_ufunc, method) > except AttributeError: > return NotImplemented > > return callable(*inputs, **kwargs) > > class ArrayFunctionFromModuleMixin: > > def __array_function__(self, func, types, args, kwargs): > array_module = self.__array_module__(types) > if array_module is NotImplemented: > return NotImplemented > > # Traverse submodules to find the appropriate function > modules = func.__module__.split('.') > assert modules[0] == 'numpy' > for submodule in modules[1:]: > module = getattr(module, submodule, None) > new_func = getattr(module, func.__name__, None) > if new_func is None: > return NotImplemented > > return new_func(*args, **kwargs) > > To make it easier to write duck arrays, we could also add these mixin > classes > into ``numpy.lib.mixins`` (but the examples above may suffice). > > Alternatives considered > ----------------------- > > Naming > ====== > > We like the name ``__array_module__`` because it mirrors the existing > ``__array_function__`` and ``__array_ufunc__`` protocols. Another > reasonable > choice could be ``__array_namespace__``. > > It is less clear what the NumPy function that calls this protocol should be > called (``get_array_module`` in this proposal). Some possible alternatives: > ``array_module``, ``common_array_module``, ``resolve_array_module``, > ``get_namespace``, ``get_numpy``, ``get_numpylike_module``, > ``get_duck_array_module``. > > .. _requesting-restricted-subsets: > > Requesting restricted subsets of NumPy's API > ============================================ > > Over time, NumPy has accumulated a very large API surface, with over 600 > attributes in the top level ``numpy`` module alone. It is unlikely that any > duck array library could or would want to implement all of these functions > and > classes, because the frequently used subset of NumPy is much smaller. > > We think it would be useful exercise to define "minimal" subset(s) of > NumPy's > API, omitting rarely used or non-recommended functionality. For example, > minimal NumPy might include ``stack``, but not the other stacking functions > ``column_stack``, ``dstack``, ``hstack`` and ``vstack``. This could clearly > indicate to duck array authors and users want functionality is core and > what > functionality they can skip. > > Support for requesting a restricted subset of NumPy's API would be a > natural > feature to include in ``get_array_function`` and ``__array_module__``, > e.g., > > .. code:: python > > # array_module is only guaranteed to contain "minimal" NumPy > array_module = np.get_array_module(*arrays, request='minimal') > > To facilitate testing with NumPy and use with any valid duck array library, > NumPy itself would return restricted versions of the ``numpy`` module when > ``get_array_module`` is called only on NumPy arrays. Omitted functions > would > simply not exist. > > Unfortunately, we have not yet figured out what these restricted subsets > should > be, so it doesn't make sense to do this yet. When/if we do, we could > either add > new keyword arguments to ``get_array_module`` or add new top level > functions, > e.g., ``get_minimal_array_module``. We would also need to add either a new > protocol patterned off of ``__array_module__`` (e.g., > ``__array_module_minimal__``), or could add an optional second argument to > ``__array_module__`` (catching errors with ``try``/``except``). > > A new namespace for implicit dispatch > ===================================== > > Instead of supporting overrides in the main `numpy` namespace with > ``__array_function__``, we could create a new opt-in namespace, e.g., > ``numpy.api``, with versions of NumPy functions that support dispatching. > These > overrides would need new opt-in protocols, e.g., ``__array_function_api__`` > patterned off of ``__array_function__``. > > This would resolve the biggest limitations of ``__array_function__`` by > being > opt-in and would also allow for unambiguously overriding functions like > ``asarray``, because ``np.api.asarray`` would always mean "convert an > array-like object." But it wouldn't solve all the dispatching needs met by > ``__array_module__``, and would leave us with supporting a considerably > more > complex protocol both for array users and implementors. > > We could potentially implement such a new namespace *via* the > ``__array_module__`` protocol. Certainly some users would find this > convenient, > because it is slightly less boilerplate. But this would leave users with a > confusing choice: when should they use `get_array_module` vs. > `np.api.something`. Also, we would have to add and maintain a whole new > module, > which is considerably more expensive than merely adding a function. > > Dispatching on both types and arrays instead of only types > ========================================================== > > Instead of supporting dispatch only via unique array types, we could also > support dispatch via array objects, e.g., by passing an ``arrays`` > argument as > part of the ``__array_module__`` protocol. This could potentially be > useful for > dispatch for arrays with metadata, such provided by Dask and Pint, but > would > impose costs in terms of type safety and complexity. > > For example, a library that supports arrays on both CPUs and GPUs might > decide > on which device to create a new arrays from functions like ``ones`` based > on > input arguments: > > .. code:: python > > class Array: > def __array_module__(self, types, arrays): > useful_arrays = tuple(a in arrays if isinstance(a, Array)) > if not useful_arrays: > return NotImplemented > prefer_gpu = any(a.prefer_gpu for a in useful_arrays) > return ArrayModule(prefer_gpu) > > class ArrayModule: > def __init__(self, prefer_gpu): > self.prefer_gpu = prefer_gpu > > def __getattr__(self, name): > import base_module > base_func = getattr(base_module, name) > return functools.partial(base_func, prefer_gpu=self.prefer_gpu) > > This might be useful, but it's not clear if we really need it. Pint seems > to > get along OK without any explicit array creation routines (favoring > multiplication by units, e.g., ``np.ones(5) * ureg.m``), and for the most > part > Dask is also OK with existing ``__array_function__`` style overides (e.g., > favoring ``np.ones_like`` over ``np.ones``). Choosing whether to place an > array > on the CPU or GPU could be solved by `making array creation lazy > <https://github.com/google/jax/pull/1668>`_. > > .. _appendix-design-choices: > > Appendix: design choices for API overrides > ------------------------------------------ > > There is a large range of possible design choices for overriding NumPy's > API. > Here we discuss three major axes of the design decision that guided our > design > for ``__array_module__``. > > Opt-in vs. opt-out for users > ============================ > > The ``__array_ufunc__`` and ``__array_function__`` protocols provide a > mechanism for overriding NumPy functions *within NumPy's existing > namespace*. > This means that users need to explicitly opt-out if they do not want any > overridden behavior, e.g., by casting arrays with ``np.asarray()``. > > In theory, this approach lowers the barrier for adopting these protocols in > user code and libraries, because code that uses the standard NumPy > namespace is > automatically compatible. But in practice, this hasn't worked out. For > example, > most well-maintained libraries that use NumPy follow the best practice of > casting all inputs with ``np.asarray()``, which they would have to > explicitly > relax to use ``__array_function__``. Our experience has been that making a > library compatible with a new duck array type typically requires at least a > small amount of work to accommodate differences in the data model and > operations > that can be implemented efficiently. > > These opt-out approaches also considerably complicate backwards > compatibility > for libraries that adopt these protocols, because by opting in as a library > they also opt-in their users, whether they expect it or not. For winning > over > libraries that have been unable to adopt ``__array_function__``, an opt-in > approach seems like a must. > > Explicit vs. implicit choice of implementation > ============================================== > > Both ``__array_ufunc__`` and ``__array_function__`` have implicit control > over > dispatching: the dispatched functions are determined via the appropriate > protocols in every function call. This generalizes well to handling many > different types of objects, as evidenced by its use for implementing > arithmetic > operators in Python, but it has two downsides: > > 1. *Speed*: it imposes additional overhead in every function call, because > each > function call needs to inspect each of its arguments for overrides. > This is > why arithmetic on builtin Python numbers is slow. > 2. *Readability*: it is not longer immediately evident to readers of code > what > happens when a function is called, because the function's implementation > could be overridden by any of its arguments. > > In contrast, importing a new library (e.g., ``import dask.array as da``) > with > an API matching NumPy is entirely explicit. There is no overhead from > dispatch > or ambiguity about which implementation is being used. > > Explicit and implicit choice of implementations are not mutually exclusive > options. Indeed, most implementations of NumPy API overrides via > ``__array_function__`` that we are familiar with (namely, dask, CuPy and > sparse, but not Pint) also include an explicit way to use their version of > NumPy's API by importing a module directly (``dask.array``, ``cupy`` or > ``sparse``, respectively). > > Local vs. non-local vs. global control > ====================================== > > The final design axis is how users control the choice of API: > > - **Local control**, as exemplified by multiple dispatch and Python > protocols for > arithmetic, determines which implementation to use either by checking > types > or calling methods on the direct arguments of a function. > - **Non-local control** such as `np.errstate > < > https://docs.scipy.org/doc/numpy/reference/generated/numpy.errstate.html > >`_ > overrides behavior with global-state via function decorators or > context-managers. Control is determined hierarchically, via the > inner-most > context. > - **Global control** provides a mechanism for users to set default > behavior, > either via function calls or configuration files. For example, matplotlib > allows setting a global choice of plotting backend. > > Local control is generally considered a best practice for API design, > because > control flow is entirely explicit, which makes it the easiest to > understand. > Non-local and global control are occasionally used, but generally either > due to > ignorance or a lack of better alternatives. > > In the case of duck typing for NumPy's public API, we think non-local or > global > control would be mistakes, mostly because they **don't compose well**. If > one > library sets/needs one set of overrides and then internally calls a routine > that expects another set of overrides, the resulting behavior may be very > surprising. Higher order functions are especially problematic, because the > context in which functions are evaluated may not be the context in which > they > are defined. > > One class of override use cases where we think non-local and global > control are > appropriate is for choosing a backend system that is guaranteed to have an > entirely consistent interface, such as a faster alternative implementation > of > ``numpy.fft`` on NumPy arrays. However, these are out of scope for the > current > proposal, which is focused on duck arrays. > > _______________________________________________ > NumPy-Discussion mailing listNumPy-Discussion@python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Thu, 2020-02-06 at 09:35 -0800, Stephan Hoyer wrote:
Hmmm, in NumPy we can easily force basically 100% of (desired) coverage, i.e. JAX can return a namespace that implements everything. With SciPy that is already muss less feasible, and as you go to domain specific tools it seems implausible. `get_array_module` solves the issue of a library that wants to support all array likes. As long as: * most functions rely only on the NumPy API * the domain specific library is expected to implement support for specific array objects if necessary. E.g. sklearn can include special code for Dask support. Dask does not replace sklearn code.
I suppose the question is here, where should the code reside? For SciPy, I agree there is a good reason why you may want to "reverse" the implementation. The code to support JAX arrays, should live inside JAX. One, probably silly, option is to return a "global" namespace, so that: np = get_array_module(*arrays).numpy` We have to distinct issues: Where should e.g. SciPy put a generic implementation (assuming they to provide implementations that only require NumPy-API support to not require overriding)? And, also if a library provides generic support, should we define a standard of how the context/namespace may be passed in/provided? sklearn's main namespace is expected to support many array objects/types, but it could be nice to pass in an already known context/namespace (say scikit-image already found it, and then calls scikit-learn internally). A "generic" namespace may even require this to infer the correct output array object. Another thing about backward compatibility: What is our vision there actually? This NEP will *not* give the *end user* the option to opt-in! Here, opt-in is really reserved to the *library user* (e.g. sklearn). (I did not realize this clearly before) Thinking about that for a bit now, that seems like the right choice. But it also means that the library requires an easy way of giving a FutureWarning, to notify the end-user of the upcoming change. The end- user will easily be able to convert to a NumPy array to keep the old behaviour. Once this warning is given (maybe during `get_array_module()`, the array module object/context would preferably be passed around, hopefully even between libraries. That provides a reasonable way to opt-in to the new behaviour without a warning (mainly for library users, end-users can silence the warning if they wish so). - Sebastian
![](https://secure.gravatar.com/avatar/93a76a800ef6c5919baa8ba91120ee98.jpg?s=120&d=mm&r=g)
On Thu, Feb 6, 2020 at 12:20 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
My main concern with a "global namespace" is that it adds boilerplate to the typical usage of fetching a duck-array version of NumPy. I think the simplest proposal is to add a "module" argument to both get_array_module and __array_module__, with a default value of "numpy". This adds flexibility with minimal additional complexity. The main question is what the type of arguments for "module" should be: 1. Modules could be specified as strings, e.g., "numpy" 2. Module could be specified as actual namespace, e.g., numpy from import numpy. The advantage of (1) is that in theory you could write np.get_array_module(*arrays, module='scipy.linalg') without the overhead of actually importing scipy.linalg or without even needing scipy to be installed, if all the arrays use a different scipy.linalg implementation. But in practice, this seems a little far-fetched. All alternative implementations of scipy that I know of (e.g., in JAX or conceivably in Dask) import the original library. The main downside of (1) is that it would would mean that NumPy's ndarray.__array_module__ would need to use importlib.import_module() to dynamically import modules. It also adds a potentially awkward asymmetry between the "module" and "default" arguments, unless we also switched default to specify modules with strings. Either way, the "default" argument will probably need to be adjusted so that by default it matches whatever value is passed into "module", instead of always defaulting to "numpy". Any thoughts on which of these options makes most sense? We could also put off making any changes to the protocol now, but this change seems pretty safe and appear to have real use-cases (e.g., for sklearn) so I am inclined to go ahead with it now before finalizing the NEP.
I don't think NumPy needs to do anything about warnings. It is straightforward for libraries that want to use use get_array_module() to issue their own warnings before calling get_array_module(), if desired. Or alternatively, if a library is about to add a new __array_module__ method, it is straightforward to issue a warning inside the new __array_module__ method before returning the NumPy functions.
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Sun, Feb 23, 2020 at 3:31 PM Stephan Hoyer <shoyer@gmail.com> wrote:
I don't think this is quite enough. Sebastian points out a fairly important issue. One of the main rationales for the whole NEP, and the argument in multiple places ( https://numpy.org/neps/nep-0037-array-module.html#opt-in-vs-opt-out-for-user...) is that it's now opt-in while __array_function__ was opt-out. This isn't really true - the problem is simply *moved*, from the duck array libraries to the array-consuming libraries. The end user will still see the backwards incompatible change, with no way to turn it off. It will be easier with __array_module__ to warn users, but this should be expanded on in the NEP. Also, I'm still not sure I agree with the tone of the discussion on this topic. It's very heavily inspired by what the JAX devs are telling you (the NEP still says PyTorch and scipy.sparse as well, but that's not true in both cases). If you ask Dask and CuPy for example, they're quite happy with __array_function__ and there haven't been many complaints about backwards compat breakage. Cheers, Ralf _______________________________________________
![](https://secure.gravatar.com/avatar/93a76a800ef6c5919baa8ba91120ee98.jpg?s=120&d=mm&r=g)
On Sun, Feb 23, 2020 at 3:59 PM Ralf Gommers <ralf.gommers@gmail.com> wrote:
Ralf, thanks for sharing your thoughts. I'm not quite I understand the concerns about backwards incompatibility: 1. The intention is that implementing a __array_module__ method should be backwards compatible with all current uses of NumPy. This satisfies backwards compatibility concerns for an array-implementing library like JAX. 2. In contrast, calling get_array_module() offers no guarantees about backwards compatibility. This seems nearly impossible, because the entire point of the protocol is to make it possible to opt-in to new behavior. So backwards compatibility isn't solved for Scikit-Learn switching to use get_array_module(), and after Scikit-Learn does so, adding __array_module__ to new types of arrays could potentially have backwards incompatible consequences for Scikit-Learn (unless sklearn uses default=None). Are you suggesting just adding something like what I'm writing here into the NEP? Perhaps along with advice to consider issuing warnings inside __array_module__ and falling back to legacy behavior when first implementing it on a new type? We could also potentially make a few changes to make backwards compatibility even easier, by making the protocol less aggressive about assuming that NumPy is a safe fallback. Some non-exclusive options: a. We could switch the default value of "default" on get_array_module() to None, so an exception is raised if nothing implements __array_module__. b. We could includes *all* argument types in "types", not just types that implement __array_module__. NumPy's ndarray.__array_module__ could then recognize and refuse to return an implementation if there are other arguments that might implement __array_module__ in the future (e.g., anything outside the standard library?). The downside of making either of these choices is that it would potentially make get_array_function() a bit less usable, because it is more likely to fail, e.g., if called on a float, or some custom type that should be treated as a scalar. Also, I'm still not sure I agree with the tone of the discussion on this
I'm linking to comments you wrote in reference to PyTorch and scipy.sparse in the current draft of the NEP, so I certainly want to make sure that you agree my characterization :). Would it be fair to say: - JAX is reluctant to implement __array_function__ because of concerns about breaking existing code. JAX developers think that when users use NumPy functions on JAX arrays, they are explicitly choosing to convert from JAX to NumPy. This model is fundamentally incompatible __array_function__, which we chose to override the existing numpy namespace. - PyTorch and scipy.sparse are not yet in position to implement __array_function__ (due to a lack of a direct implementation of NumPy's API), but these projects take backwards compatibility seriously. Does "take backwards compatibility seriously" sound about right to you? I'm very open to specific suggestions here. (TensorFlow could probably also be safely added to this second list.) Best, Stephan
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Sun, 2020-02-23 at 22:44 -0800, Stephan Hoyer wrote:
Just to be clear, the way scikit-learn would probably be handling backward compatibility concerns is by adding it to their configuration context manager, see: https://github.com/scikit-learn/scikit-learn/pull/16574 So the backward compat is in a sense solved (but there are project specific context managers involved – which is not perfect maybe, but OK). I am willing to consider pushing this off into its own namespace (and package, preferably in the NumPy org though) if necessary, the idea being that we keep it super minimal, and expand it as we go to keep up with scikit-learn needs. Possibly even with a function registration approach, so that you could have import time checks on function availability and signature mismatch easier. I still do not like the idea of context managers much though, I think I prefer the returned (bound) namespace a lot. Also I think we should *not* do implicit dispatching. Consider this case: def numpy_only(x): x = np.asarray(x) return x + _helper(len(x)) def generic(x): module = np.get_array_module(x) x = module.asarray(x) return x + _helper(len(x)) def _helper(n, module=np): return module.random.unform(size=n) If you try to make the above work with context managers, you _still_ need to pass in the module to _helper [1], because otherwise you would have to change the `numpy_only` function to ensure an outside context does not change its behaviour. - Sebastian [1] If "module" had a `module.set_backend()` and was a global instead `_helper` using the global module would do the wrong thing for `numpy_only`. This is of course also a bit of an issue with the sklearn context manager as well, but it seems to me _much_ less so, and probably not if most libraries slowly switch over and currently use `np.asarray`.
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Sun, 2020-02-23 at 22:44 -0800, Stephan Hoyer wrote:
I think that should be sufficient, personally. We could mention that scikit-learn will likely use a context manager to do this. We can also think about providing a global default (which sklearn can use as its own default if they wish so, but that is reserved to the end-user). That would be a small amendment, and I think we could add it even after accepting the NEP as it is.
I am not sure that I feel switching the default to None makes much of a difference to be honest. Unless we use it to signal a super strict mode similar to b. below.
That is a good point, anything that is not NumPy recognized could simply be rejected. It does mean that you have to call `module.asarray()` manually more often though. For `list`, it could also make sense to just add np.ndarray to types. If we want to be conservative, maybe we could also just error before calling `__array_module__`. Whenever there is something that we do not know how to interpret force the user to clarify?
Right, although we could relax it later if it seems overly annoying.
This will need input from Ralf, my personal main concern is backward compatibility in libraries: I am pretty sure sklearn would only use a potential `np.asduckarray` when the user opted in. But in that case my personal feeling is that the `get_array_module` solution is cleaner and makes it easier to expand functionality slowly (for libraries). Two other points: First, I am wondering if we should add something like a `__qualname__` to the contract. I.e. a returned module must have a well defined `module.__name__` (that is usually already correct), so that sklearn could do: module = np.get_array_module(*arrays) if module.__name__ not in ("numpy", "sparse"): raise TypeError("Currently only numpy and sparse are supported") if they wish so (that is trivial, but if you return a class acting as a module it may be important). Second, we have to make progress on whether or not the "restricted" namespace idea should have priority. My personal opinion is tending strongly towards no. The NumPy version should normally be older than other libraries, and if NumPy updates the API so do the downstream implementers. E.g. dask may have to provide multiple version of the same function depending on the installed NumPy version, but that seems OK to me? It is just as downstream libraries currently have to support multiple NumPy versions. We could add a contract that the first time `get_array_module` is used to e.g. get the dask namespace and the NumPy version is too new, a warning should be given. The practical thing seems to me that we ignore this for the moment (as something we can do later on)? If there is missing API, in most cases an AttributeError will be raised which could provide some additional information to the user? The only alternative seems the complete opposite?: Create a new module, and make even NumPy only one of the implementers of that new (restricted) module. That may be cleaner, but I fear that it is impractical to be honest. I will put this on the agenda for tomorrow, even if we discuss it only very briefly. My feeling (and hope) is that we are nearing a point where we can make a final decision. Best, Sebastian
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Wed, Mar 4, 2020 at 1:22 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Sorry, this never made it back to the top of my todo list.
Indeed, it is nearly impossible. Except if there's a context manager or some other control mechanism exposed to the end user. Hence that should be a part of the design I think. Otherwise you're just solving something for the JAX devs, but not for the scikit-learn/scipy/etc devs who will then each have to invent their own wheel for backwards compat. So backwards compatibility isn't solved for Scikit-Learn
+1 That would be a small amendment, and I think we could add it even after
I agree, that doesn't make a difference.
Interesting point. Not accepting sequences could be considered here. It may help a lot with robustness and typing to only accept ndarray, other objects with __array__, and scalars.
agreed
True. I would say though that scipy.sparse will never implement either __array_function__ or array_module__ due to semantic imcompatibilities (it acts like np.matrix). So it's kind of irrelevant. And if PyTorch gets around to adding a numpy-compatible API, they're fine with __array_function__.
I think it's quite important, and __array_module__ gives a chance to introduce it. However, it's not ready - so I'd say that if __array_module__ implementation is ready and there's no well-defined restricted API proposal (I expect to have that in August), then we can move ahead without it. The NumPy version should normally be older than other libraries, and if
That seems unworkable, and I don't think any libraries do this. Coupling the semantics of a single Dask function to the installed numpy version is odd. It is just as downstream libraries currently have to support multiple
I think we can't solve this until we have a well-defined API, which is the restricted API + API versioning. Until then it just remains with the current status, compatibility is implementation-defined. Cheers, Ralf
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Thu, 2020-04-09 at 13:52 +0200, Ralf Gommers wrote:
Is it all that odd? Libraries (not array providers) already need to test for NumPy version occasionally due to API changes, so they also have two versions of the same thing around (e.g. a fallback). This simply would move the burden to the array-object implementer to some degree. Assume that we have a versioned API in some form or another, it seems to me we either require: module = np.get_array_module(..., api_version=2) or define `module.__api_version__`. Where the latter means that sklearn/SciPy may have to check `__api_version__` on every function call, while currently such checks usually happen at import time. On the other hand, the former means that sklearn/scipy can only opt-in to new API after 3+ years easily? Saying that the NumPy version is what pins the api-version, is not much more than assuming/requiring that NumPy will be the least up-to-date package? Of course it is unworkable to get 100% right in practice but are you saying that because it seems like an impractical approach, or because the API surface is currently so large that, of course, we will never get it 100% right (but that is generally true, nobody will be able to implement NumPy 100% compatible)? `__array_function__` has same issue? If we change our API, Dask has to catch up. If SciPy expects it to be the old version though (based on the NumPy import) it will incorrectly assume the old-api will be used. - Sebastian
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Thu, Apr 9, 2020 at 6:54 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
That's completely different, it's internal to a library and not visible to end users via different signatures/behavior. This simply would move the burden to the array-object implementer to
Yes, this is the version I was thinking about.
That's anyway the case, has very little to do with API versioning I think - it's simply determined by minimum NumPy version supported.
Yes this, impractical and undesired. or because
That's true too, we *don't want* anyone to start adding compat features for outdated or "wish we could deprecate" NumPy features.
`__array_function__` has same issue? If we change our API, Dask has to catch up.
Yes, that's true. The restricted API should be more stable than the whole NumPy API, otherwise no one will be able to be fully compatible. If SciPy expects it to be the old version though (based on
the NumPy import) it will incorrectly assume the old-api will be used.
That's not incorrect unless it's a backwards-incompatible change, which should be rare. Cheers, Ralf
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 2/23/20 6:59 PM, Ralf Gommers wrote:
Might it be possible to flip this NEP back to opt-out while keeping the nice simplifications and configurabile array-creation routines, relative to __array_function__? That is, what if we define two modules, "numpy" and "numpy_strict". "numpy_strict" would raise an exception on duck-arrays defining __array_module__ (as numpy currently does). "numpy" would be a wrapper around "numpy_strict" that decorates all numpy methods with a call to "get_array_module(inputs).func(inputs)". Then end-user code that did "import numpy as np" would accept ducktypes by default, while library developers who want to signal they don't support ducktypes can opt-out by doing "import numpy_strict as np". Issues with `np.as_array` seem mitigated compared to __array_function__ since that method would now be ducktype-aware. Cheers, -Allan
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Fri, 2020-02-28 at 11:28 -0500, Allan Haldane wrote:
This would be possible, but I think we strongly leaned against the idea. Basically, if you have to opt-out, from a library perspective there may be `np.asarray` calls, which for example later call into C and expect arrays. So, I have large doubts that an opt-out solution works easily for library authors. Array function is opt-out, but effectively most clean library code already opted out... We had previously discussed the opposite, of having a namespace of implicit dispatching based on get_array_module, but if we keep array function around, I am not sure there is much reason for it.
My tendency is that if we want to go there, we would need to push ahead with the `np.duckarray()` idea instead. To be clear: I currently very much prefer the get_array_module() idea. It just seems much cleaner for library authors, and they are the primary issue at the moment in my opinion. - Sebastian
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Wed, 2020-04-08 at 17:04 -0400, Andreas Mueller wrote:
Hey, thanks for the ping. Things are a bit stuck right now. I think what we need is some clarity on the implications and alternatives. I was thinking about organizing a small conference call with the main people interested in the next weeks. There are also still some alternatives to this NEP in the race, and we may need to clarify which ones are actually still in the race... Maybe to see some of the possible sticking points: 1. What do we do about SciPy, have it under this umbrella? And how would we want to design that. 2. Context managers have some composition issues, maybe less so if they are in the downstream package. Or should we have global defaults as well? 3. How do we ensure safe transitions for users as much as possible. * If you use this, can functions suddenly return a different type in the future? * Should we force you to cast to NumPy arrays in a transition period, or force you to somehow silence a transition warning? 4. Is there a serious push to have a "reduced" API or even a versioned API? But I am probably forgetting some other things. In my personal opinion, I think NEP 37 with minor modifications is still the best duck in the race. I feel we should be able to find a reasonable solution for SciPy. Point 2. about Context managers may be true, but this is much smaller in scope from the ones uarray proposed IIRC, and I could not figure out major scoping issues with it yet (the sklearn draft). About the safe transition, that may be the stickiest point. But e.g. if you enable `get_array_module` sklearn could limit a certain function to error out if it finds something other than NumPy? The main problem is how to do opt-in into future behaviour. A context manager can do that, although the danger is that someone just uses that everywhere... On the reduced/versioned API front, I would hope that we can defer that as a semi-orthogonal issue, basically saying that for now you have to provide a NumPy API that faithfully reproduces whatever NumPy version is installed on the system. Cheers, Sebastian
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Thu, Apr 9, 2020 at 12:02 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Current feeling: best to ignore it for now. It's quite a bit of work to fix API incompatibilities for linalg that no one currently seems interested in tackling. We can revisit once that's done.
+1 for adding this right next to get_array_module().
There is, it'll take a few months.
I think it would be nice to have a separate NEP 37 implementation outside of NumPy to play with. Unlike __array_function__, I don't think it has to go into NumPy immediately. This avoids the whole "experimental API" issue, it would be quite useful to test this with, e.g., CuPy + scikit-learn without being stuck with any decisions in a released NumPy version. Also makes switching on/off very easy for users, just (don't) `pip install numpy-array-module`. Cheers, Ralf
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Thu, 2020-04-09 at 13:52 +0200, Ralf Gommers wrote:
<snip>
Fair enough, I have created a hopefully working start here: https://github.com/seberg/numpy_dispatch (this is not tested much at all yet, so it could be very buggy). There are a couple of additional features that I added. 1. A global opt-in (it is impossible to opt-out once opted in!) 2. A local opt-in (to guarantee opt-in if global flag is not set) 3. I added features to allow transitioning:: get_array_module(*arrays, modules="numpy", future_modules=("dask.array", "cupy"), fallback="warn") Will give FutureWarning/DeprecationWarning where necessary, in the above "numpy" is supported, dask and cupy are supported but not enabled by default. `None` works to say "all modules". Once the transition is done, just move dask and cupy into `modules` and remove `fallback=None`. 4. If there are FutureWarnings/DeprecationWarnigs the user needs to be able to opt-in to future behaviour. Opting out can be done by casting inputs. Opting-in is done using:: with future_dispatch_behavior(): call_library_function() Obviously, we may not want these features, but I was curious how we could provide the tools to allow clean transitions. Both context managers should be thread-safe, but I did not test that. The best try would probably be cupy and sklearn again, so I will give a ping on the sklearn PR. To make that easier, I tried to hack a bit of a "util" to allow testing (please scroll down on the readme on github). Best, Sebastian
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Thu, 2020-04-09 at 22:11 -0500, Sebastian Berg wrote:
There is no immediate need to put modules and future_modules and fallback in there. The main convenience it gives is that we can more easily provide the user to opt-in context manager to opt-in to the new behaviour. Without that, libraries will have to do these checks, that is not difficult. But if we wish to provide a context manager to opt all of that in, the library will need additional API to query our context manager state. Or every library needs their own solution, which does not seem desirable (although it means you cannot opt-in internal functions accidentally to newer behaviour). - Sebastian
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Fri, Apr 10, 2020 at 5:17 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Thanks!
So future_modules explicitly excludes compatible libraries that are not listed. Why would you want anyone to do that? I don't understand "supported but not enabled", and it looks undesirable to me to special-case any library in this mechanism. Cheers, Ralf 4. If there are FutureWarnings/DeprecationWarnigs the user needs to be
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Fri, 2020-04-10 at 12:27 +0200, Ralf Gommers wrote:
We hav two (or three) types of modules (either could be "all"). 1. Supported modules that we dispatch to. 2. Modules that are supported but will be dispatched to by default only in the future. So if the user got a future_module, they will get a FutureWarning. They have to chose to cast the inputs or opt-in to the future behaviour. 3. Unsupported modules: If this is resolved it is an error. I currently assume that this does not need to be a negative list. You need to distinguish those somehow, since you need a way to transition. Even if you expect that modules would always be *all* modules, `numpy` is still the only accepted module originally. So, as I said, `future_modules` is only about transitioning and enabling `FutureWarning`s. Does not have to live there, but we need a way to transition. These options do not have to be handled by us, it only helps here with having context managers to opt-in to new behaviour, and maybe to get an idea for how transitions can look like. Alternatively, we could all to create project specific context managers to do the same and avoid possible scoping issues even more. - Sebastian
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Fri, Apr 10, 2020 at 3:03 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
I think we only have modules that implement __array_module__, and ones that don't.
Sorry, I still don't get it - transition what? You seem to be operating on the assumption that the users of get_array_module want or need to control which numpy-like libraries they allow and which they don't. That seems fundamentally wrong. How would you treat, for example, an array library that is developed privately inside some company? Cheers, Ralf
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Fri, 2020-04-10 at 18:19 +0200, Ralf Gommers wrote:
Well, you still need to transition from NumPy -> allow everything, so for now please just ignore that part if you like and use/assume: get_array_module(..., modules="numpy", future_modules=None, fallback="warn") during the transition, and: get_array_module(...) after it. After all this is a draft-project right now, so it is just as much about trying out what can be done. It is not unlikely that this transition burden will be put more on the library in any case, but it shows that it can be done. As to my "fundamentally wrong" assumption. Should libraries goal be to support everything? Definitely! But... I do not want to make that decision for libraries, so I if library authors tell me that they have no interest in it, all the better. Until then I am more than happy to keep that option on the table. If just as a thought for library authors to consider their options. Possible, brainstorming, reasons could be: 1. Say I currently heavily use cython code, so I am limited to NumPy (or at least arrays that can expose a buffer/`__array_interface__`). Now if someone adds a CUDA implementation, I would support cupy arrays, but not distributed arrays. I admit maybe checking that at function entry like this is the wrong approach there. 2. To limit to certain types is to say "We know (and test) that our library works with xarray, Dask, NumPy, and CuPy". Now you can say that is also a misconception, because if you stick to just NumPy API you should know that it will "just work" with everything. But in practice it seems like it might happen? In that case you may want to actually allow any odd array and just put a warning, a bit like the transition warnings I put in for testing. --- There are two other things I am wondering about. 1. Subclasses may want to return their superclasses module (even by default?), in which case their behaviour depends on the superclass module behaviour. Further a library would need to use `np.asanyarray()` to prevent the subclass from being cast to the superclass. 2. There is one transition that does not quite exists. What if an array-like starts implementing or expands `array-module`? That seems fine, but in that case the array-like will have to provide the `opt-in` context manager with a FutureWarning. The transition from no `__array_module__` to implementing it may need some thought, but I expect it is fine: The array-like simply always gives a FutureWarning, although it cannot know what will actually happen in the future (no change, error, or array-like takes control). - Sebastian
participants (6)
-
Allan Haldane
-
Andreas Mueller
-
Eric Wieser
-
Ralf Gommers
-
Sebastian Berg
-
Stephan Hoyer