Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015

Sept. 1, 2015

      On Sun, Aug 30, 2015 at 9:12 PM, Marten van Kerkwijk
<m.h.vankerkwijk@gmail.com> wrote:
...
Hi Nathaniel, others,
I read the discussion of plans with interest. One item that struck me is
that while there are great plans to have a proper extensible and presumably
subclassable dtype, it is discouraged to subclass ndarray itself (rather, it
is encouraged to use a broader array interface). From my experience with
astropy in both Quantity (an ndarray subclass), Time (a separate class
containing high precision times using two ndarray float64), and Table
(initially holding structured arrays, but now sets of Columns, which
themselves are ndarray subclasses), I'm not convinced the broader, new
containers approach is that much preferable. Rather, it leads to a lot of
boiler-plate code to reimplement things ndarray does already (since one is
effectively just calling the methods on the underlying arrays).
I also think the idea that a dtype becomes something that also contains a
unit is a bit odd. Shouldn't dtype just be about how data is stored? Why
include meta-data such as units?
Instead, I think a quantity is most logically seen as numbers with a unit,
just like masked arrays are numbers with masks, and variables numbers with
uncertainties. Each of these cases adds extra information in different
forms, and all are quite easily thought of as subclasses of ndarray where
all operations do the normal operation, plus some extra work to keep the
extra information up to date.
The intuition behind the array/dtype split is that an array is just a
container: it knows how to shuffle bytes around, be reshaped, indexed,
etc., but it knows nothing about the meaning of the items it holds --
as far as it's concerned, each entry is just an opaque binary blobs.
If it wants to actually do anything with these blobs, it has to ask
the dtype for help.

The dtype, OTOH, knows how to interpret these blobs, and (in
cooperation with ufuncs) to perform operations on them, but it doesn't
need to know how they're stored, or about slicing or anything like
that -- all that's the container's job.

Think about it this way: does it make sense to have a sparse array of
numbers-with-units? how about a blosc-style compressed array of
numbers-with-units? If yes, then numbers-with-units are a special kind
of dtype, not a special kind of array.

Another way of getting this intuition: if I have 8 bytes, that could
be an int64, or it could be a float64. Which one it is doesn't affect
how it's stored at all -- either way it's stored as a chunk of 8
arbitrary bytes. What it affects is how we *interpret* these bytes --
e.g. there is one function called "int64 addition" which takes two 8
byte chunks and returns a new 8 byte chunk as the result, and a second
function called "float64 addition" which takes those same two 8 byte
chunks and returns a different one. The dtype tells you which of these
operations should be used for a particular array. What's special about
a float64-with-units? Well, it's 8 bytes, but the addition operation
is different from regular float64 addition: it has to do some extra
checks and possibly unit conversions. This is exactly what the ufunc
dtype dispatch and casting system is there for.

This also solves your problem with having to write lots of boilerplate
code, b/c if this is a dtype then it means you can just use the actual
ndarray class directly without subclassing or anything :-).
...
Anyway, my suggestion would be to *encourage* rather than discourage ndarray
subclassing, and help this by making ndarray (even) better.
So, we very much need robust support for
objects-that-quack-like-an-array that are *not* ndarrays, because
ndarray subclasses are forced to use ndarray-style strided in-memory
storage, and there's huge demand for objects that expose an array-like
interface but that use a different storage strategy underneath: sparse
arrays, compressed arrays (like blosc), out-of-core arrays,
computed-on-demand arrays (like dask), distributed arrays, etc. etc.

And once we have solid support for duck-arrays and for user-defined
dtypes (as discussed above), then those two things remove a huge
amount of the motivation for subclassing ndarray.

At the same time, ndarray subclassing is... nearly unmaintainable,
AFAICT. The problem with subclassing is that you're basically taking
some interface, making a copy of it, and then monkeypatching the copy.
As you would expect, this is intrinsically very fragile, because it
breaks abstraction barriers. Suddenly things that used to be
implementation details -- like which methods are implemented in terms
of which other methods -- become part of the public API. And there's
never been any coherent, documentable theory of how ndarray
subclassing is *supposed* to work, so in practice it's just a bunch of
ad hoc hooks designed around the needs of np.matrix and np.ma. We get
a regular stream of bug reports asking us to tweak things one way or
another, and it feels like trying to cover the floor with a too-small
carpet -- we end up with an API that covers the need of whoever
complained most recently.

And there's the thing where as far as we can tell, 99% of the people
who have ever sat down and tried to subclass ndarray ended up
regretting it :-). Seriously, you are literally the only person who
I've ever heard say positive things about the experience, and I can't
really see why given how often I see you in the bug tracker
complaining about some weird breakage :-). So there aren't many people
motivated to work on it...

If someone has a good plan for how to fix all this then by all means,
speak up :-). But IMO it's better to write some boilerplate that you
can control than to import + monkeypatch, even if the latter seems
easier in the short run. And there's a lot we can do to reduce that
boilerplate -- e.g. when you want to implement a new sequence type in
Python you can write your __getitem__ and __len__ and then use
collections.abc.Sequence to fill in the rest of the interface; we've
been talking about adding something similar for arrays as part of the
__numpy_ufunc__ work.

-n

-- 
Nathaniel J. Smith -- http://vorpus.org

Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015

Nathaniel Smith