[Numpy-discussion] new NEP: np.AbstractArray and np.asabstractarray

Thu Mar 22 06:35:46 EDT 2018

I think that with your comments in mind, it may just be best to embrace
duck typing, like Matthew suggested. I propose the following workflow:

   - __array_concatenate__ and similar "protocol" functions return
   NotImplemented if they won't work.
   - "Base functions" that can be called directly like __getitem__ raise
   NotImplementedError if they won't work.
   - __arrayish__ = True

Then, something like np.concatenate would do the following:

   - Call __array_concatenate__ following the same order as ufunc arguments.
   - If everything fails, raise NotImplementedError (or convert everything
   to ndarray).

Overloaded functions would do something like this (perhaps a simple
decorator will do for the repetitive work?):

   - Try with np.arrayish
   - Catch NotImplementedError
      - Try with np.array

Then, we use abstract classes just to overload functionality or implement
things in terms of others. If something fails, we have a decent fallback.
We don't need to do anything special in order to "check" functionality.

Feel free to propose changes, but this is the best I could come up with
that would require the smallest incremental changes to Numpy while also
supporting everything right from the start.

On Thu, Mar 22, 2018 at 9:14 AM, Nathaniel Smith <njs at pobox.com> wrote:

> On Sat, Mar 10, 2018 at 4:27 AM, Matthew Rocklin <mrocklin at gmail.com>
> wrote:
> > I'm very glad to see this discussion.
> >
> > I think that coming up with a single definition of array-like may be
> > difficult, and that we might end up wanting to embrace duck typing
> instead.
> >
> > It seems to me that different array-like classes will implement different
> > mixtures of features.  It may be difficult to pin down a single
> definition
> > that includes anything except for the most basic attributes (shape and
> > dtype?).  Consider two extreme cases of restrictive functionality:
> >
> > LinearOperators (support dot in a numpy-like way)
> > Storage objects like h5py (support getitem in a numpy-like way)
> >
> > I can imagine authors of both groups saying that they should qualify as
> > array-like because downstream projects that consume them should not
> convert
> > them to numpy arrays in important contexts.
>
> I think this is an important point -- there are a lot of subtleties in
> the interfaces that different objects might want to provide. Some
> interesting ones that haven't been mentioned:
>
> - a "duck array" that has everything except fancy indexing
> - xarray's arrays are just like numpy arrays in most ways, but they
> have incompatible broadcasting semantics
> - immutable vs. mutable arrays
>
> When faced with this kind of situation, always it's tempting to try to
> write down some classification system to capture every possible
> configuration of interesting behavior. In fact, this is one of the
> most classic nerd snipes; it's been catching people for literally
> thousands of years [1]. Most of these attempts fail though :-).
>
> So let's back up -- I probably erred in not making this more clear in
> the NEP, but I actually have a fairly concrete use case in mind here.
> What happened is, I started working on a NEP for
> __array_concatenate__, and my thought pattern went as follows:
>
> 1) Cool, this should work for np.concatenate.
> 2) But what about all the other variants, like np.row_stack. We don't
> want __array_row_stack__; we want to express row_stack in terms of
> concatenate.
> 3) Ok, what's row_stack? It's:
>   np.concatenate([np.atleast_2d(arr) for arr in arrs], axis=0)
> 4) So I need to make atleast_2d work on duck arrays. What's
> atleast_2d? It's: asarray + some shape checks and indexing with
> newaxis
> 5) Okay, so I need something atleast_2d can call instead of asarray [2].
>
> And this kind of pattern shows up everywhere inside numpy, e.g. it's
> the first thing inside lots of functions in np.linalg b/c they do some
> futzing with dtypes and shape before delegating to ufuncs, it's the
> first thing the mean() function does b/c it needs to check arr.dtype
> before proceeding, etc. etc.
>
> So, we need something we can use in these functions as a first step
> towards unlocking the use of duck arrays in general. But we can't
> realistically go through each of these functions, make an exact list
> of all the operations/attributes it cares about, and then come up with
> exactly the right type constraint for it to impose at the top. And
> these functions aren't generally going to work on LinearOperators or
> h5py datasets anyway.
>
> We also don't want to go through every function in numpy and add new
> arguments to control this coercion behavior.
>
> What we can do, at least to start, is to have a mechanism that passes
> through objects that aspire to be "complete" duck arrays, like dask
> arrays or sparse arrays or astropy's unit arrays, and then if it turns
> out that in practice people find uses for finer-grained distinctions,
> we can iteratively add those as a second pass. Notice that if a
> function starts out requiring a "complete" duck array, and then later
> relaxes that to accept "partial" duck arrays, that's actually
> increasing the domain of objects that it can act on, so it's a
> backwards-compatible change that we can do later.
>
> So I think we should start out with a concept of "duck array" that's
> fairly strong but a bit vague on the exact details (e.g.,
> dask.array.Array is currently missing some weird things like arr.ptp()
> and arr.tolist(), I guess because no-one has ever noticed or cared?).
>
> ------------
>
> Thinking things through like this, I also realized that this proposal
> jumps through hoops to avoid changing np.asarray itself, because I was
> nervous about changing the rule that its output is always an
> ndarray... but actually, this is currently the rule for most functions
> in numpy, and the whole point of this proposal is to relax that rule
> for most functions, in cases where the user is explicitly passing in a
> duck-array object. So maybe I'm being overparanoid? I'm genuinely
> unsure here.
>
> Instead of messing about with ABCs, an alternative mechanism would be
> to add a new method __arrayish__ (hat tip to Tom Caswell for the name
> :-)), that essentially acts as an override for Python-level calls to
> np.array / np.asarray, in much the same way that __array_ufunc__
> overrides ufuncs, etc. (C level calls to PyArray_FromAny and similar
> would of course continue to return ndarray objects, and I assume we'd
> add some argument like require_ndarray= that you could pass to
> explicitly indicate whether you needed C-level compatibility.)
>
> This would also allow objects like h5py datasets to *produce* an
> arrayish object on demand, even if they aren't one themselves. (E.g.,
> imagine some hdf5-like storage that holds sparse arrays instead of
> regular arrays.)
>
> I'm thinking I may write this option up as a second NEP, to compete
> with my first one.
>
> -n
>
> [1] See: https://www.wiley.com/en-us/The+Search+for+the+Perfect+
> Language-p-9780631205104
> [2] Actually atleast_2d calls asanyarray, not asarray, but that's just
> a detail; the way to solve this problem for asanyarray is to first
> solve it for asarray.
>
> --
> Nathaniel J. Smith -- https://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180322/ee0edf71/attachment-0001.html>