
On Sun, Aug 30, 2015 at 9:12 PM, Marten van Kerkwijk <m.h.vankerkwijk@gmail.com> wrote:
Hi Nathaniel, others,
I read the discussion of plans with interest. One item that struck me is that while there are great plans to have a proper extensible and presumably subclassable dtype, it is discouraged to subclass ndarray itself (rather, it is encouraged to use a broader array interface). From my experience with astropy in both Quantity (an ndarray subclass), Time (a separate class containing high precision times using two ndarray float64), and Table (initially holding structured arrays, but now sets of Columns, which themselves are ndarray subclasses), I'm not convinced the broader, new containers approach is that much preferable. Rather, it leads to a lot of boiler-plate code to reimplement things ndarray does already (since one is effectively just calling the methods on the underlying arrays).
I also think the idea that a dtype becomes something that also contains a unit is a bit odd. Shouldn't dtype just be about how data is stored? Why include meta-data such as units?
Instead, I think a quantity is most logically seen as numbers with a unit, just like masked arrays are numbers with masks, and variables numbers with uncertainties. Each of these cases adds extra information in different forms, and all are quite easily thought of as subclasses of ndarray where all operations do the normal operation, plus some extra work to keep the extra information up to date.
The intuition behind the array/dtype split is that an array is just a container: it knows how to shuffle bytes around, be reshaped, indexed, etc., but it knows nothing about the meaning of the items it holds -- as far as it's concerned, each entry is just an opaque binary blobs. If it wants to actually do anything with these blobs, it has to ask the dtype for help. The dtype, OTOH, knows how to interpret these blobs, and (in cooperation with ufuncs) to perform operations on them, but it doesn't need to know how they're stored, or about slicing or anything like that -- all that's the container's job. Think about it this way: does it make sense to have a sparse array of numbers-with-units? how about a blosc-style compressed array of numbers-with-units? If yes, then numbers-with-units are a special kind of dtype, not a special kind of array. Another way of getting this intuition: if I have 8 bytes, that could be an int64, or it could be a float64. Which one it is doesn't affect how it's stored at all -- either way it's stored as a chunk of 8 arbitrary bytes. What it affects is how we *interpret* these bytes -- e.g. there is one function called "int64 addition" which takes two 8 byte chunks and returns a new 8 byte chunk as the result, and a second function called "float64 addition" which takes those same two 8 byte chunks and returns a different one. The dtype tells you which of these operations should be used for a particular array. What's special about a float64-with-units? Well, it's 8 bytes, but the addition operation is different from regular float64 addition: it has to do some extra checks and possibly unit conversions. This is exactly what the ufunc dtype dispatch and casting system is there for. This also solves your problem with having to write lots of boilerplate code, b/c if this is a dtype then it means you can just use the actual ndarray class directly without subclassing or anything :-).
Anyway, my suggestion would be to *encourage* rather than discourage ndarray subclassing, and help this by making ndarray (even) better.
So, we very much need robust support for objects-that-quack-like-an-array that are *not* ndarrays, because ndarray subclasses are forced to use ndarray-style strided in-memory storage, and there's huge demand for objects that expose an array-like interface but that use a different storage strategy underneath: sparse arrays, compressed arrays (like blosc), out-of-core arrays, computed-on-demand arrays (like dask), distributed arrays, etc. etc. And once we have solid support for duck-arrays and for user-defined dtypes (as discussed above), then those two things remove a huge amount of the motivation for subclassing ndarray. At the same time, ndarray subclassing is... nearly unmaintainable, AFAICT. The problem with subclassing is that you're basically taking some interface, making a copy of it, and then monkeypatching the copy. As you would expect, this is intrinsically very fragile, because it breaks abstraction barriers. Suddenly things that used to be implementation details -- like which methods are implemented in terms of which other methods -- become part of the public API. And there's never been any coherent, documentable theory of how ndarray subclassing is *supposed* to work, so in practice it's just a bunch of ad hoc hooks designed around the needs of np.matrix and np.ma. We get a regular stream of bug reports asking us to tweak things one way or another, and it feels like trying to cover the floor with a too-small carpet -- we end up with an API that covers the need of whoever complained most recently. And there's the thing where as far as we can tell, 99% of the people who have ever sat down and tried to subclass ndarray ended up regretting it :-). Seriously, you are literally the only person who I've ever heard say positive things about the experience, and I can't really see why given how often I see you in the bug tracker complaining about some weird breakage :-). So there aren't many people motivated to work on it... If someone has a good plan for how to fix all this then by all means, speak up :-). But IMO it's better to write some boilerplate that you can control than to import + monkeypatch, even if the latter seems easier in the short run. And there's a lot we can do to reduce that boilerplate -- e.g. when you want to implement a new sequence type in Python you can write your __getitem__ and __len__ and then use collections.abc.Sequence to fill in the rest of the interface; we've been talking about adding something similar for arrays as part of the __numpy_ufunc__ work. -n -- Nathaniel J. Smith -- http://vorpus.org